Biomedical Engineering Reference
In-Depth Information
Statistics of Alignment
Given that much of the day-to-day statistical work in bioinformatics involves using tools that utilize
statistical principles to explore nucleotide and protein sequences, a review of some of the principles
related to the statistics of alignment are in order. Because good alignment of nucleotide sequences
can occur by chance alone, statistical methods, often combined with heuristics, are used to help
determine the significance of an alignment. For example, the BLAST algorithm computes the
expected frequency of matching sequences that should occur in an alignment search in order to
conduct a more efficient search.
In calculating an alignment score ( S ), the underlying question is usually "is the alignment score high
enough to suggest homology?" The first part of the answer is to determine how high a score could
occur by chance alone. However, the challenge here is no mathematical theory adequately describes
statistics of the scores that can be expected for global alignments. In lieu of an underlying
mathematical basis for computing the significance of global alignments, ad-hoc methods have been
devised for comparing alignment scores with scores of random sequences that seem to align, using
sequences the same length and composition as those under study.
The situation is different for local alignment, because extreme value distribution adequately describes
the expected distribution of random local alignment scores. By relating the observed direct score to
the expected distribution, the statistical significance of alignment can be assessed.
A statistic commonly used in alignment searches is the z-score, which is a measure of the distance
from the mean, measured in standard deviation units. If each sequence to be aligned is randomized
and an optimal alignment is made, the result is a series of scores ( S ) for the alignment of two
sequences, with a mean (
) and standard deviation (
). In this scenario, the z-score ( z ) is computed
m
d
as:
The advantage of a z-score over a simple percentage score is that it corrects for compositional biases
in the sequence and accounts for the varying length of sequences. The problem with using a z-score
to assess whether an alignment occurred by chance is that a z-score assumes a normal distribution.
However, alignment data don't follow a normal distribution. As a result, a higher z-score should be
taken as a threshold of significance.
Distributions have different uses in bioinformatics statistical works. Binomial distributions are used for
spotting stretches of DNA with unusual nucleotide sequences and pair-wise sequence comparisons.
Normal distributions are used for modeling continuous random variables, with applications such as
the statistical significance of pairwise sequence comparison. Multinomial distributions are used for
spotting stretches of DNA with unusual content, distinguishing tests for introns by composition, and
quantifying relative codon frequency.
Relying solely on purely mathematical methods for statistical analysis without incorporating heuristics
or knowledge of the underlying biology can often lead to incorrect conclusions. For example, a run of
pure C-G sequences in a sequence to be aligned will likely match many C-G-rich regions in a
sequence database. Based on this knowledge, masks can be used to hide these regions from the
database search, allowing the search algorithm to ignore these regions during the search process.
 
 
Search WWH ::




Custom Search