IMPALA/RPS-BLAST/PSI-BLAST in protein sequence analysis (Bioinformatics)

1. Introduction: philosophy of profile-based analysis

Sequence evolution is a largely stochastic process. Random undirected mutations occur during the DNA replication. Depending on the effect of these mutations on the structure and function of the protein, they may become fixed in the population. According to a neutral theory of molecular evolution (Kimura, 1983), the majority of the fixed mutations are effectively neutral. As two proteins diverge, following either speciation or gene duplication within the same genome, they accumulate substitutions and their similarity gradually declines over time. Random, mostly neutral, character of protein evolution ensures practical impossibility of sequence convergence of unrelated proteins.

Given a reasonable mathematical model of amino acid composition and molecular evolution of proteins, one can obtain precise quantitative estimates of several important measures. In particular, the distribution of sequence similarity measures (fraction of identical amino acids or alignment score) for random (i.e., unrelated) sequences can be computed (Karlin and Altschul, 1990). If a pair of sequence produces an alignment, whose similarity exceeds random expectations, it is a strong indication of the homology of the proteins. However, if the proteins diverged long ago, their similarity drops to the level where it is undistinguishable from the statistical “noise”. For such proteins, one cannot justify the claim of homology on the basis of their alignment score. The level of similarity where confident inference of homology becomes impossible is often referred to as “twilight zone” (Doolittle, 1981).


Evolution of proteins is characterized by remarkable nonuniformity of mutability across sequence sites. Some parts of the polypeptide chain are extremely tolerant to amino acid substitutions, while others require specific residues to preserve the function and/or structure of the protein (Doolittle, 1981). The overall nonuniform shape of the distribution of evolution rates between sites can be taken into account in the sequence evolution models, but there is no general way to account for properties of particular sites in particular protein families.

(a) A fragment of an alignment of two distantly related homologous sequences (tufB and coaA). (b) A fragment of an alignment of two unrelated sequences (tufB and fimI). Asterisks indicate residues identical between the two sequences

Figure 1 (a) A fragment of an alignment of two distantly related homologous sequences (tufB and coaA). (b) A fragment of an alignment of two unrelated sequences (tufB and fimI). Asterisks indicate residues identical between the two sequences

Two pairwise alignments with the same (low) level of sequence similarity are equivalent from a statistical point of view; however, they might have very different biological meaning, readily distinguishable by an expert in the particular group of proteins. Both alignment fragments in Figure 1 have 14-16% of identical residues. The relevance of alignment A is self-evident to anyone familiar with P-loop ATPases (Saraste et al., 1990), as it correctly juxtaposes important residues of the so called Walker A motif; alignment B is a spurious match of unrelated proteins.

The distinction comes easy if the proteins are well studied and the relative importance of different fragments and residues of the polypeptide chain is known. For less-characterized proteins, this approach is, by definition, unproductive.

The difference in mutability of different protein sites is easy to observe comparing multiple proteins separated by various evolutionary distances. Sites, evolving at slower rates, are more often seen as amino acid matches in pairwise comparisons; fast-evolving sites do not display conservation beyond the random expectation. Compiling a multiple alignment of distant but unambiguously related proteins allows one to assess relative prominence of different sites. Taking into account multiple alignments built for two proteins that display low sequence identity could help answer the question about the relevance of the similarity: if the alignment of these sequences, however poor, largely follows sites, conserved in each of the multiple alignments, then the alignment is likely to be biologically significant (Figure 2a); if the two multiple alignments are discordant, the juxtaposition is probably spurious (Figure 2b). Going from closely related proteins to more and more distant ones, one can learn features specific for a given level of evolutionary relationships and extend the range of confident biological conclusions far into the “twilight zone” (Doolittle, 1981).

It is possible to formalize this observation introducing the notion of sequence profile (Gribskov et al., 1987). Each position in a profile corresponds to a site in a protein; but, unlike a straightforward protein sequence, each site is represented not by a single amino acid but rather by a spectrum of possible variants. Highly variable sites are represented by a widely dispersed distribution of amino acids; for conserved sites, the distribution is sharply biased toward a particular amino acid (or class of amino acids, e.g., “aromatic” or “positively charged”). The simplest form of a profile representing an alignment of length L is a matrix of L x 20 dimension where each column corresponds to a site in alignment, each row corresponds to an amino acid, and each matrix element shows a frequency of a given amino acid in a given alignment position. In practice, instead of straightforward frequencies, profiles often contain precomputed weights, associated with occurrence of a given amino acid in a given position of the alignment; such profile is often referred to as position-specific score matrix (PSSM). PSI-BLAST (Altschul et al., 1997) and HMMer (Durbin et al., 1998) program families are among the most widely used profile-based sequence analysis software.

 (a) Alignment of sequences in Figure 1(a) juxtaposed with multiple=

Figure 2 (a) Alignment of sequences in Figure 1(a) juxtaposed with multiple alignments for tufB and coaA. (b) Alignment of sequences in Figure 1(b) juxtaposed with multiple alignments for tufB and fiml. Asterisks indicate the same pairs of residues as in Figure 1. Alignments are colored according to physicochemical properties of amino acid residues

2. PSI-BLAST: Position-specific Iterated BLAST

Position-specific iterated BLAST was developed at the National Center for Biotechnology Information (NCBI) (Altschul et al., 1997). Its profile-based features dramatically improved sensitivity and specificity of protein sequence similarity search.

PSI-BLAST generalizes the regular BLAST scoring scheme for a comparison between a query PSSM and a single target sequence. Query PSSM is computed from the query-anchored multiple alignment (sometimes referred to as “master-slave alignment” with the query in the role of the master sequence). Scores in each column of the PSSM reflect the frequencies of different amino acids in the corresponding alignment position “diluted” by background amino acid frequencies to dampen the effect of sampling error. Query-target alignment is scored using the PSSM instead of a presupplied scoring matrix, as it is done in regular BLAST. To alleviate the effect of many highly similar sequences “outvoting” less abundant divergent forms, the amino acid frequencies used to compute the PSSM are adjusted according to a sequence weighting scheme by Henikoff and Henikoff (1994).

The primary mode of PSI-BLAST application is the iterative search starting with a single sequence as the query. The first iteration consists of a regular BLAST search against a given database. Sequences, found to be confidently similar with the query (i.e., with e-values below a specified threshold), contribute to a query-anchored multiple alignment. Weighted amino acid frequencies and position-specific scores are computed for each alignment position. Well-conserved sites, naturally, tend to favor a particular amino acid (or a narrow range of amino acids), producing position-specific scores that severely discriminate between “right” and “wrong” amino acids. Amino acid frequencies in highly variable sites approach background frequency distribution, resulting in scores uniformly close to zero. In the subsequent iterations, target sequences are compared against the query PSSM. Matches or mismatches against sites conserved in the previous iteration are scored in a sharply contrasted manner; variable sites are scored neutrally regardless of the target amino acid. Thus, target sequences that preserve the pattern of conserved sites are scored higher than those that have the same number of matching sites, but spaced in different manner. In each iteration, the signal contained in the conserved sites is reinforced by new matches; the noise in variable sites is further diluted. Highly diverged homologous sequences that were scored low in previous iterations usually get progressively higher scores, rising in the ranked list of hits.

Obviously, the success of PSI-BLAST search strongly depends on the database content and on the query choice. If the database contains a great number of sequences, related to the query, with gradually declining similarity, there is a good chance that the iterated procedure will eventually retrieve them all. It is especially helpful if the selected query is equidistant from the other homologs and does not contain too many insertions and deletions compared to the majority of its relatives. If there is a wide gap between the immediate relatives of the query and its more distant homologs, the search is likely to prematurely converge after a few iterations (i.e., no new sequences can be recognized as significantly similar because the narrow group of the closest relatives does not allow for sufficient discrimination between important and irrelevant sites). Another problem with PSI-BLAST search is the potential of “PSSM explosion”. If an unrelated sequence somehow makes it into profile (usually because of its compositional bias), it is possible that it influences the scores strong enough that at the next iteration its own relatives would also be scored high. Subsequent iteration would bring more hits, unrelated to the original query. It often leads to a widespread “flattening” of the profile, with loss of distinction between the conserved and variable sites, further reducing the search specificity.

The PSSM, constructed in the course of the iterated search, can be saved and later reused, provided that the query and the search parameters, affecting statistical calculations, remain the same. If this search is performed against the same database, it has simply the effect of bypassing the previous (often time-consuming) iterations. The PSSM, however, can be used to search a different database. Typically, one constructs family-specific profiles by iterative searches against a sequence database, containing maximum available diversity of sequences. These profiles are later used for searches in particular genomes or other specialized datasets. Protein PSSM queries can be also used in a search against a nucleotide sequence database, the latter being dynamically translated in all six frames.

PSI-BLAST can initiate a search with a PSSM constructed from a multiple sequence alignment. One of the alignment sequences (with gap characters removed) must be designated as the master query; thus many different, although similar, PSSMs can be obtained from the same multiple alignment. This approach has an advantage of the possibility to use a high-quality expert-curated alignment to make a profile.

Recent modifications of PSI-BLAST introduce further modification of the scoring matrices depending on the amino acid frequency bias in the query (composition-based statistics). Practically, this feature greatly increases the search specificity for the price of a certain reduction of sensitivity; this option is especially attractive for fully automated projects.

PSI-BLAST is distributed as a stand-alone program that can be used for a search against a local database. Web-based PSI-BLAST service, provided by the National Center for Biotechnology Information, contains an important additional feature: a possibility of direct user control over the sequences, contributing to the PSSM construction. Checking and unchecking control boxes, the user can include or exclude sequences regardless of their formally computed statistical significance. Allowing an expert opinion to influence the PSSM composition gives the approach much greater flexibility.

3. IMPALA: Integrating Matrix Profiles and Local Alignments

Accumulation of a collection of high-quality profiles poses a technical challenge of reversing a profile search scenario. Instead of using PSSM queries in a search against a sequence database, it is often practical to run a (series of) single-sequence query against a library of profiles. IMPALA, released in 1999, provided this capability (Schaffer et al., 1999). A collection of PSI-BLAST profiles and corresponding master query sequences needs to be processed into a searchable library for use with IMPALA. Unlike BLAST family of programs, which use empirical “X-dropoff” algorithm to produce and score local alignment between the query and the target (Altschul et al., 1997), IMPALA implements rigorous Smith-Waterman algorithm (Smith and Waterman, 1981), producing provably optimal results. This, of course, trades off the execution speed for the accuracy. Additionally, IMPALA included several advanced techniques of PSSM handling (notably, finer scaling of the scores), later incorporated into mainstream BLAST software. IMPALA is distributed as a stand-alone suite of programs.

4. RPS-BLAST: Reverse PSI-BLAST

RPS-BLAST (Marchler-Bauer et al., 2002) is a direct implementation of BLAST search algorithms in a reversed manner – with a protein sequence query and a PSSM library as a target. Like with IMPALA, PSI-BLAST profiles are postprocessed into a searchable database for use with RPS-BLAST. RPS-BLAST has a significant advantage in running speed over IMPALA, with only a minor degradation of sensitivity. RPS-BLAST is the search engine of the CD-search service (Marchler-Bauer et al., 2002) of the National Center for Biotechnology Information, available through a Web interface.

Next post:

Previous post: