Databases Reference
In-Depth Information
quality of the classification significantly; and (3) model-based classification such as using
hidden Markov model (HMM) or other statistical models to classify sequences.
For time-series or other numeric-valued data, the feature selection techniques for
symbolic sequences cannot be easily applied to time-series data without discretization.
However, discretization can cause information loss. A recently proposed time-series
shapelets method uses the time-series subsequences that can maximally represent a class
as the features. It achieves quality classification results.
Alignment of Biological Sequences
Biological sequences generally refer to sequences of nucleotides or amino acids. Biolog-
ical sequence analysis compares, aligns, indexes, and analyzes biological sequences and
thus plays a crucial role in bioinformatics and modern biology.
Sequence alignment is based on the fact that all living organisms are related by evo-
lution. This implies that the nucleotide (DNA, RNA) and protein sequences of species
that are closer to each other in evolution should exhibit more similarities. An alignment
is the process of lining up sequences to achieve a maximal identity level, which also
expresses the degree of similarity between sequences. Two sequences are homologous
if they share a common ancestor. The degree of similarity obtained by sequence align-
ment can be useful in determining the possibility of homology between two sequences.
Such an alignment also helps determine the relative positions of multiple species in an
evolution tree, which is called a phylogenetic tree .
The problem of alignment of biological sequences can be described as follows: Given
two or more input biological sequences, identify similar sequences with long conserved sub-
sequences . If the number of sequences to be aligned is exactly two, the problem is known
as pairwise sequence alignment ; otherwise, it is multiple sequence alignment . The
sequences to be compared and aligned can be either nucleotides (DNA/RNA) or amino
acids (proteins). For nucleotides, two symbols align if they are identical. However, for
amino acids, two symbols align if they are identical, or if one can be derived from the
other by substitutions that are likely to occur in nature. There are two kinds of align-
ments: local alignments and global alignments . The former means that only portions of
the sequences are aligned, whereas the latter requires alignment over the entire length of
the sequences.
For either nucleotides or amino acids, insertions, deletions, and substitutions occur
in nature with different probabilities. Substitution matrices are used to represent the
probabilities of substitutions of nucleotides or amino acids and probabilities of inser-
tions and deletions. Usually, we use the gap character, , to indicate positions where
it is preferable not to align two symbols. To evaluate the quality of alignments, a scor-
ing mechanism is typically defined, which usually counts identical or similar symbols as
positive scores and gaps as negative ones. The algebraic sum of the scores is taken as the
alignment measure. The goal of alignment is to achieve the maximal score among all the
possible alignments. However, it is very expensive (more exactly, an NP-hard problem)
to find optimal alignment. Therefore, various heuristic methods have been developed to
find suboptimal alignments.
 
Search WWH ::




Custom Search