Divide and Conquer Strategies for Protein Structure Prediction - Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Biomedical Engineering Reference

In-Depth Information

reasonable accuracy by clustering the sequences in an aligned family, and assessing

the degree of sequence variability observed between very similar pairs. Lately, this

idea was exploited by Rost and Sander, who showed that it was possible to improve

the accuracy of the prediction of secondary structures and solvent accessibility intro-

ducing evolutionary information in the form of sequence profiles as input to neural

networks [ 42 ].

Differently from an MSA, whose dimension increases linearly with the number

of aligned sequences, a sequence profile of a protein is a matrix P whose columns

represent the sequence positions and whose rows are the 20 possible residue sym-

bols. The profile matrix P is computed from a MSA and it is relative to a specific

sequence of interest p. Each element P ai of the sequence profile represents the nor-

malised frequency of the residue type a in the aligned position i . In practice, given

an MSA that contains the sequence of interest p, we derive the column i of the cor-

responding profile by computing the frequencies of occurrence of each residue in the

column of the MSA corresponding to the i th residue of p. In this way, the informa-

tion contained in a profile P is not dependent on the number of aligned sequences so

that it becomes easy to use fragments of the matrix P as input for machine learning

methods.

The computation of an MSA for a query sequence is a complex process both

in terms of time and care required. It consists of two steps. First, a search of the

query sequence against a non-redundant dataset of protein sequences is needed in

order to select a set of chains that are similar to the query one. There are several

optimal and near-optimal pairwise-alignment algorithms to perform such searches.

Currently, the heuristic basic local alignment search tool (BLAST) [ 2 ] is considered

the standard-de-facto software for pairwise sequence comparison. Despite the fact

that exact algorithms are available for pairwise sequence comparison, the heuristic

BLAST is the most widely used due to its speed (non-redundant datasets can con-

tain millions of different protein sequences) and good performance compared to

exact algorithms. The selection of similar sequences must be performed carefully in

order to avoid the introduction of meaningless sequences in the MSA, such as se-

quences with low complexity regions. Low complexity regions represent sequences

of very non-random composition (“simple sequences,” “compositionally-biased

regions”). They are abundant in natural sequences and may determine high scoring

matching segments in unrelated protein sequences. To avoid this problem, BLAST

implements a filter procedure based on the SEG [ 49 ] software. SEG provides a mea-

sure of compositional complexity of a sequence segment and divides sequences into

contrasting segments of low complexity and high complexity. Typically, globular

domains have higher sequence complexity than fibrillar or conformationally disor-

dered protein segments. When used in BLAST, SEG replaces the low complexity

regions within the input sequence with X 's to prevent spurious matching with unre-

lated sequences.

When the set of similar sequences has been selected, the second step consists

of building an MSA. Differently from the pairwise sequence alignment problem,

building an optimal multiple alignment is a difficult task and it is not computable in

reasonable time. Several software implementations of heuristic algorithms for MSA

Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Search WWH ::

Custom Search

Home