Biomedical Engineering Reference
In-Depth Information
reasonable accuracy by clustering the sequences in an aligned family, and assessing
the degree of sequence variability observed between very similar pairs. Lately, this
idea was exploited by Rost and Sander, who showed that it was possible to improve
the accuracy of the prediction of secondary structures and solvent accessibility intro-
ducing evolutionary information in the form of sequence profiles as input to neural
networks [ 42 ].
Differently from an MSA, whose dimension increases linearly with the number
of aligned sequences, a sequence profile of a protein is a matrix P whose columns
represent the sequence positions and whose rows are the 20 possible residue sym-
bols. The profile matrix P is computed from a MSA and it is relative to a specific
sequence of interest p. Each element P ai of the sequence profile represents the nor-
malised frequency of the residue type a in the aligned position i . In practice, given
an MSA that contains the sequence of interest p, we derive the column i of the cor-
responding profile by computing the frequencies of occurrence of each residue in the
column of the MSA corresponding to the i th residue of p. In this way, the informa-
tion contained in a profile P is not dependent on the number of aligned sequences so
that it becomes easy to use fragments of the matrix P as input for machine learning
methods.
The computation of an MSA for a query sequence is a complex process both
in terms of time and care required. It consists of two steps. First, a search of the
query sequence against a non-redundant dataset of protein sequences is needed in
order to select a set of chains that are similar to the query one. There are several
optimal and near-optimal pairwise-alignment algorithms to perform such searches.
Currently, the heuristic basic local alignment search tool (BLAST) [ 2 ] is considered
the standard-de-facto software for pairwise sequence comparison. Despite the fact
that exact algorithms are available for pairwise sequence comparison, the heuristic
BLAST is the most widely used due to its speed (non-redundant datasets can con-
tain millions of different protein sequences) and good performance compared to
exact algorithms. The selection of similar sequences must be performed carefully in
order to avoid the introduction of meaningless sequences in the MSA, such as se-
quences with low complexity regions. Low complexity regions represent sequences
of very non-random composition (“simple sequences,” “compositionally-biased
regions”). They are abundant in natural sequences and may determine high scoring
matching segments in unrelated protein sequences. To avoid this problem, BLAST
implements a filter procedure based on the SEG [ 49 ] software. SEG provides a mea-
sure of compositional complexity of a sequence segment and divides sequences into
contrasting segments of low complexity and high complexity. Typically, globular
domains have higher sequence complexity than fibrillar or conformationally disor-
dered protein segments. When used in BLAST, SEG replaces the low complexity
regions within the input sequence with X 's to prevent spurious matching with unre-
lated sequences.
When the set of similar sequences has been selected, the second step consists
of building an MSA. Differently from the pairwise sequence alignment problem,
building an optimal multiple alignment is a difficult task and it is not computable in
reasonable time. Several software implementations of heuristic algorithms for MSA
Search WWH ::




Custom Search