Genomes in general, and eukaryotic genomes in particular, are rife with segments of repetitive sequence. Repetition is the most obvious pattern found in genetic information, and is usually indicative of a biologically significant motif or landmark in that particular biomolecule. Repetitive segments of DNA appear to be necessary for the structural function of centromeres and telomeres (Louis, 2002; Schueler et al., 2001). Gene duplications also lead to repetitive motifs, such as those found in biologically important regions such as ribosomal RNA genes (Heilig et al., 2003) or the human Y chromosome (Repping etal., 2002), reflecting the evolutionary importance of these areas. Consequently, some of the earliest analytical tools developed involve detection, identification, and classification of repeat regions in protein and DNA sequences. With the availability of whole genomes, such repeat finding algorithms are an essential aspect of genome level analysis and annotation. In fact, repetitive regions often confound the assembly stage of genome sequencing, so the successful assembly of whole genomes is itself dependent on identifying and masking repeats.
Although identifying repeats in biological sequences would, on its face, appear to be a mathematically simple problem of searching for repeated substrings within a larger string, several factors confound the issue. Biological repeat segments are not necessarily exact repeats, and the periodicity of the pattern seldom matches an exact formula. Despite these fuzzy boundaries, biological repeats retain a physiological significance, and therefore, the algorithms written have accommodated these inexact specifications. In addition, comparisons of repeat motifs within and among whole genomes demand logarithmically increasing amounts of memory and processing power; most recent implementations of repeat finding software utilize more sophisticated searching algorithms such as suffix-tree mapping to address these issues (Delcher et al., 1999).
In general, repeat analysis software packages have focused on one or more of the following tasks: detection of repeat patterns (which can be subtle), de novo identification of basic repeat units (Bao and Eddy, 2002), clustering and classification of known repeat units (Volfovsky et al., 2001), and curation of such repeats. In this section, we will take a survey of the different software implementations for repeat finding.
Detection of repeated segments in strings of characters is a difficult problem in computation, particularly in biology where the basic unit of repetition can vary not only in sequence but also in size and displacement. The human eye, however, is adept at picking out patterns laid out in a two-dimensional figure; thus, one of the simplest and yet most effective means of detecting repeated patterns in sequences is the dot-matrix plot (Gibbs and McIntyre, 1970). At the simplest implementation of this plot, two sequences that are being compared are laid out along the two axes of a two-dimensional grid. For every character in common between the two strings, one dot is placed at the intersection of the two coordinates. While this approach can be used to visualize the commonalities between any two strings, when the same string is laid on both axes, repeated motifs become evident to the human eye (Figure 1). A central diagonal is always present as the sequence is compared against itself, direct repeats manifest as parallel lines to the central diagonal, and inverted repeats are present as perpendicular lines. The distance from the diagonal indicates displacement, and palindromic repeats cross the central diagonal. This approach has proved to be so effective in detecting repeat motifs that despite increasing sophistication in implementation, the basic algorithm has remained essentially unchanged.
The earliest implementation as a software package was described by Pustell and Kafatos (1982), and implementations are found in various software packages ranging from commercial to open source; one such example is dottup, a component of the EMBOSS package (Rice et al., 2000). Since a dot-matrix plot relies on the pattern recognition skills of the human eye, good interactivity and flexibility with the manipulation of the dot-matrix image is key to a good implementation. While lines can denote significant repeat structures, nonspecific or low complexity matches will be displayed as clouds of dots that can occlude the visibility of significant sequence features. Perhaps the most sophisticated interface for exploring dot-matrix plots currently available is in the dotter package (Sonnhammer and Durbin, 1995). Relying on the X11 graphical user layer in Unix, the dotter package plots the images as a gray-scale plot, and provides a gray ramping tool that permits on the fly alterations to the visualization threshold. Dynamic zooming into segments, and the ability to export to high-resolution PostScript data completes the package.
Figure 1 The dotter interface. Loaded here is one arm of the Dictyostelium discoideum extrachromosomal rDNA element, lines parallel to the central diagonal indicate direct repeats in the sequence. Clusters of shorter repeats are seen as multiple parallel lines. Note the greyramp tool at the lower left corner that can be used to set the threshold of the image to improve detection of shorter repeat motifs
While dotter theoretically presents no real limit to the amount of data that it can process, realistically, as the input data increases (such as that of a chromosome-sized piece of DNA), the dotter package slows down dramatically. Much of the performance bottleneck is in the matching algorithm used to align segments to the genome. Word-search-based algorithms that rely on preindexing a table of all possible word combinations in a sequence are far faster in traversing large sequences, and is the method implemented by the software package lbdot (Huang and Zhang, 2004). Although the interface is less polished for exploring the resulting plot, it is able to process large sequences at a fraction of the time and memory required by dotter.
Inherent to the problem of detecting repeats is the need to align and search whole genomes efficiently. The REPuter package efficiently searches for repeats and palindromes in genome scale sequences. Initially implemented just to detect the largest exact matches for the substrings, it has since been improved to permit the presence of degenerate repeats with allowable mismatches (Kurtz et al., 2001). It is available in limited form for Web-based interaction, but a licensed stand-alone version provides increased flexibility. An approach to aligning whole genomes that uses the suffix-tree algorithm to detect repeats is the MUMmer package from The Institute of Genome Research (TIGR) (Kurtz etal., 2004). Now at version 3, the MUMmer package is uniquely flexible in not only aligning any genome (or fragment thereof) with any other, but can base the alignment on the protein coding potential of the genome, detecting duplicated genes that may have diverged sufficiently on the nucleotide level to avoid detection with a direct dot-matrix plot (Figure 2). The MUMmer package is fast enough that an on-line version is available for arbitrary alignments between the different completed microbial genomes archived at the Comprehensive Microbial Resource at TIGR (Peterson et al., 2001). More recently, a new Repeat Analysis Program has become available that improves on these approaches by indexing gapped words, increasing the sensitivity of the word-searching method (Campagna et al., 2005).
Eukaryotic genomes carry not only a load of simple and tandem repeats but also a number of interspersed repeats, particularly those deriving from transposable elements and their by-products. Although simple repeats are easy to identify, repeated de novo identification of more complex repeats through word-finding can be impractical for every new genome that needs to be analyzed, or at the very least inefficient for genomes that are closely related to a previously analyzed one. In these circumstances, repeat finding can be accelerated by the use of a preexisting library of curated known repeat sequences.
Figure 2 MUMmer output. This illustrates MUMmer displaying the E. coli genome against itself, highlighting the presence of short inverted repeats. For this display, the minimum sensitivity is 100 bp. A red dot indicates an alignment in the same direction, and a green dot for the opposite direction. The multicolored bars on each side show the gene density for the chromosome, each colored bar representing one gene
Although repeated sequence motifs can be indicative of biologically important regions of genomes, the sheer abundance of interspersed complex repetitive sequences can obscure annotation and analysis efforts, particularly if a particular element encodes potential coding sequences that can interfere with ab initio gene prediction software. Thus, masking out the repeats is an equally important subsequent step to repeat detection. The most common software package for this purpose is RepeatMasker, now an open source package housed by the Institute for Systems Biology (Smit et al., 1996-2004). At its core, RepeatMasker relies on a database of known repeat sequences, and the matching algorithm cross_ match (Gordon et al., 1998). However, at the highest sensitivity settings, cross_ match can be very slow with large sequences, leading to the development of the accelerator package MaskerAid (Bedell et al., 2000), which replaced the matching algorithm with the significantly faster WU-BLAST algorithm. The current implementation of RepeatMasker integrates the option to use the WU-BLAST method.
Databases of prototypic sequences from repetitive elements are key to the operation of such masking software. Although the on-line resource RepBase was originally crafted to curate prototypic sequences from the human genome, it has since expanded to accommodate most known eukaryotic repeats, and releases them in database form for use by programs such as RepeatMasker (Jurka et al., 1992). Managed by the nonprofit Genetic Information Research Institute, RepBase also electronically publishes a specialized electronic journal dedicated to studying eukaryotic repeats called RepBase Reports. The RepBase team also implements an alternative masking program called CENSOR (Jurka etal., 1996). Housed in Taiwan, the Repeat Sequence Database is another such database, originally designed to explore the relationship between repetitive sequence data and transcriptional activation sites (Horng et al., 2003).
As with most software implementations, the packages described in this section continue to evolve, and will no doubt be supplemented and supplanted by new software in the future. Table 1 summarizes the different URLs where the software is available, and all are either open source or available for free for academic use at the time of this writing.