Computational motif discovery (Bioinformatics)

 

Given the vast amounts of sequence data being generated by numerous genome and proteome projects, on which regions should biologists focus their attention? The goal of computational motif discovery is to predict relatively short subsequences that are good candidates to serve some biological function. This article focuses on the computational prediction of protein-binding sites in nucleotide sequences (for instance, transcription factor binding sites in DNA promoter regions) as a specific example. However, most of the ideas carry over to other applications such as predicting protein domains, splice sites, and so on.

The binding sites to be discovered are typically quite short, between 6 and 25 nucleotides in length. If one tabulated the binding sites of a single DNA-binding protein, one would find that they are not all identical because the protein typically tolerates some variation in the sequences it binds. Instead, a typical collection of binding sites for a single protein might look something like the hypothetical example shown in Figure 1, in which some positions are well conserved (for instance, C in position 2 of each binding site) and other positions less conserved (for instance, position 1).

It is this variation among binding sites that makes computational motif discovery challenging. There are many different motif “models” that can be adopted in an attempt to characterize the variation. One of the simplest such models is the consensus string. For instance, the eight binding sites in Figure 1 can be described as being near the consensus ACGTCGT, allowing up to two nucleotide substitutions from this consensus. A second simple motif model is the IUPAC description. Each of the binding sites in Figure 1 is an instance of the IUPAC string RCGWYGW, where R stands for either of the purines A or G, Y stands for either of the pyrimidines C or T, and W stands for either of the weakly binding nucleotides A or T.

Both the consensus and IUPAC motif models have the advantage that motif discovery algorithms can systematically enumerate all possible motifs, since a motif is characterized by a single consensus or IUPAC string. Along with this advantage comes a disadvantage that these models do not have the nuanced expressiveness that sometimes is needed to describe a collection of binding sites accurately. Possibly, a more realistic motif model is the weight matrix, sometimes called position-specific scoring matrix. An example of a weight matrix for the binding sites of Figure 1 is shown in Figure 2. This matrix has one column for each of the seven nucleotide positions of the binding site. Column p assigns scores to each of the nucleotides that might appear in position p of a binding site. The higher the total score over all seven positions, the closer a potential binding site is to the collection of binding sites in Figure 1. For instance, the aforementioned consensus string ACGTCGT has a score of 1.2 + 1.9 + 1.9 + 0.9 + 0.9 + 1.9 + 1.2 = 9.9, the highest possible score for this matrix.

Eight hypothetical binding sites

Figure 1 Eight hypothetical binding sites


A

1.2

-3.2

-3.2

0.9

-3.2

-3.2

0.5

C

-3.2

1.9

-3.2

-3.2

0.9

-3.2

-3.2

G

0.5

-3.2

1.9

-3.2

-3.2

1.9

-3.2

T

-3.2

-3.2

-3.2

0.9

0.9

-3.2

1.2

Figure 2 Log likelihood ratio weight matrix for example of Figure 1

Once a motif model has been selected, the next question is the type of motif discovery application. The input nucleotide sequences might be genes from a single genome that are each suspected to contain binding sites of a single protein, perhaps collected from knockout or gene expression experiments. Alternatively, the sequences might be homologous genes, typically one each from multiple related genomes. These two applications are distinguished by the names statistical over-representation and phylogenetic footprinting, respectively.

In the case of statistical overrepresentation, three different types of motif discovery algorithms have proved successful. The first type is enumerative, using either the consensus or IUPAC motif model enumeratively as outlined above. Examples of this approach are the programs RSAT (van Helden et al., 2000) and YMF (Sinha and Tompa, 2002). A second type of algorithm might be described as iterative improvement. These typically use the weight matrix model and alternate between (1) improving the current set of binding sites, given the current weight matrix, and (2) improving the current weight matrix, given the current set of binding sites. Successful programs that use this approach are the Gibbs sampler (Lawrence et al., 1993) and MEME (Bailey and Elkan, 1995). The third approach also uses the weight matrix model and what computer scientists call the greedy method, progressively adding to the current set of binding sites that short sequence that is closest to them. This approach is exemplified by the program Consensus (Hertz and Stormo, 1999). Which of these approaches is most accurate depends to a large extent on which motif model most accurately reflects the biological sites sought.

In the case of phylogenetic footprinting, there have been two approaches used. In the traditional approach, the entire homologous sequences are first globally aligned using a multiple alignment program such as Clustal W (Thompson et al., 1994) or Dialign (Morgenstern, 1999). From this global multiple alignment, one then selects short sequences of contiguous columns that show unusually high conservation, by some measure closely related to one of the motif models discussed above. The disadvantage of this approach is that, if the sequences are too diverged, the short conserved binding sites are overwhelmed by the “noise” of the surrounding random sequence, and the alignment tool, in an attempt to align all the noise well, will fail to align the desired binding sites. To avoid this problem, the program FootPrinter (Blanchette and Tompa, 2002) was designed to search directly for well conserved short motifs without resorting to global multiple alignment at all. FootPrinter uses the phylogeny relating the sequences to guide this search.

The References section provides websites, where all of these motif discovery programs are available to users.

Next post:

Previous post: