INTRODUCTION ============ This directory contains source code related to RepeatScout, which is described in detail in the following paper: Price A.L., Jones N.C. and Pevzner P.A. 2005. De novo identification of repeat families in large genomes. To appear in Proceedings of the 13 Annual International conference on Intelligent Systems for Molecular Biology (ISMB-05). Detroit, Michigan. The purpose of the RepeatScout software is to identify repeat family sequences from genomes where hand-curated repeat databases (a la RepBase update) are not available. In fact, the output of this program can be used as input to RepeatMasker as a way of automatically masking newly-sequenced genomes. Included in this package is a more or less arbitrary 3 Mb of the human X chromosome for your testing and debugging purposes. INSTALLATION ============ To build the software, you need a standard flavor of "make" and a C compiler. To filter your repeat library to remove low-complexity sequences, you will neeed to have perl 5.5 (or better; see http://www.perl.com), nseg (Wooton and Federhen, 1993; see ftp://ftp.ncbi.nih.gov/pub/seg/nseg) and trf (Benson, 1999; see http://tandem.bu.edu/trf/trf.html). To filter your repeat library against segmental duplications, exons, or other features you will need to have RepeatMasker-open3.0 (or better; see http://www.repeatmasker.org) as well as the features of interest in GFF format. We have verified that this source code compiles and runs correctly on Linux, FreeBSD, Mac OS X, and DEC Tru64. It should work on any *nix system and might work on Windows computers if compiled with gcc (under Cygwin). If you have any experience in compiling this software for Windows, please let us know. To build the RepeatScout software, follow these steps: 1) download the source code tarball RepeatScout_1.0.0.tar.gz from http://repeatscout.bioprojects.org 2) gunzip and untar it (e.g., tar -xvfz RepeatScout_1.0.0.tar.gz). A directory named RepeatScout-1.0.0 will be created. 3) "cd RepeatScout-1.0.0" 4) "make" The software should build with no errors or warnings. Two programs will be built: build_lmer_table and RepeatScout-v1. You may leave these binaries where they are or copy them to any other location you deem appropriate; no external libraries are needed. RUNNING THE SOFTWARE ==================== Running RepeatScout proceeds in four phases. First, build_lmer_table creates a file that tabulates the frequency of all l-mers in the sequence to be analyzed. Second, RepeatScout-v1 takes this table and the sequence and produces a fasta file that contains all the repetitive elements that it could find. Third, the "filter-stage-1.prl" script is run on the output of RepeatScout-v1 to remove low-complexity elements; RepeatMasker is run on the sequence of interest using this filtered RepeatScout-v1 library. The program "filter-stage-2.prl" then filters out any repeat element that does not appear a certain number of times (by default, 10). Finally, the locations of the repeats found by RepeatMasker are used, in conjuction with GFF files that describe segmental duplications or exons or other such "uninteresting" regions to remove sequences from the library that are likely to not be mobile elements; the program "compare-out-to-gff.prl" does exactly this. The RepeatScout-v1 program requires a substantial amount of memory and a fair amount of time. On the human X chromosome, it requires approximately 8 hours on a 3 Ghz PC while using 1.6 Gb of memory. We are currently working on a way to decrease the memory usage of the program so that it can run on much larger sequences (whole genomes) in a reasonable amount of time. You can see a list of command line parameters for each program by calling the program with the "--h" flag. Parameter Choices ----------------- The only parameter which we recommend the user adjust is l, the length of l-mer seeds. We have found ceil(log_4(L)+1) to be a suitable choice, where ceil(x) = smallest integer greater than x log_4(x) = log base 4 of x L is the length of the input sequence See the help file for the RepeatScout program (--h) for a list of other tunable parameters. References: Benson G. 1999. Tandem repeats finder -- a program to analyze DNA sequences. Nucleic Acids Res. 27:573--80. Wootton, J. C. and S. Federhen (1993). Statistics of local complexity in amino acid sequences and sequence databases. Computers and Chemistry 17:149--63.