Data Mining Trends and Research Frontiers - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

quality of the classification significantly; and (3) model-based classification such as using

hidden Markov model (HMM) or other statistical models to classify sequences.

For time-series or other numeric-valued data, the feature selection techniques for

symbolic sequences cannot be easily applied to time-series data without discretization.

However, discretization can cause information loss. A recently proposed time-series

shapelets method uses the time-series subsequences that can maximally represent a class

as the features. It achieves quality classification results.

Alignment of Biological Sequences

Biological sequences generally refer to sequences of nucleotides or amino acids. Biolog-

ical sequence analysis compares, aligns, indexes, and analyzes biological sequences and

thus plays a crucial role in bioinformatics and modern biology.

Sequence alignment is based on the fact that all living organisms are related by evo-

lution. This implies that the nucleotide (DNA, RNA) and protein sequences of species

that are closer to each other in evolution should exhibit more similarities. An alignment

is the process of lining up sequences to achieve a maximal identity level, which also

expresses the degree of similarity between sequences. Two sequences are homologous

if they share a common ancestor. The degree of similarity obtained by sequence align-

ment can be useful in determining the possibility of homology between two sequences.

Such an alignment also helps determine the relative positions of multiple species in an

evolution tree, which is called a phylogenetic tree .

The problem of alignment of biological sequences can be described as follows: Given

two or more input biological sequences, identify similar sequences with long conserved sub-

sequences . If the number of sequences to be aligned is exactly two, the problem is known

as pairwise sequence alignment ; otherwise, it is multiple sequence alignment . The

sequences to be compared and aligned can be either nucleotides (DNA/RNA) or amino

acids (proteins). For nucleotides, two symbols align if they are identical. However, for

amino acids, two symbols align if they are identical, or if one can be derived from the

other by substitutions that are likely to occur in nature. There are two kinds of align-

ments: local alignments and global alignments . The former means that only portions of

the sequences are aligned, whereas the latter requires alignment over the entire length of

the sequences.

For either nucleotides or amino acids, insertions, deletions, and substitutions occur

in nature with different probabilities. Substitution matrices are used to represent the

probabilities of substitutions of nucleotides or amino acids and probabilities of inser-

tions and deletions. Usually, we use the gap character, , to indicate positions where

it is preferable not to align two symbols. To evaluate the quality of alignments, a scor-

ing mechanism is typically defined, which usually counts identical or similar symbols as

positive scores and gaps as negative ones. The algebraic sum of the scores is taken as the

alignment measure. The goal of alignment is to achieve the maximal score among all the

possible alignments. However, it is very expensive (more exactly, an NP-hard problem)

to find optimal alignment. Therefore, various heuristic methods have been developed to

find suboptimal alignments.

Search WWH ::

Custom Search

Home