Statistics - Bioinformatics Computing

Biomedical Engineering Reference

In-Depth Information

Statistics of Alignment

Given that much of the day-to-day statistical work in bioinformatics involves using tools that utilize

statistical principles to explore nucleotide and protein sequences, a review of some of the principles

related to the statistics of alignment are in order. Because good alignment of nucleotide sequences

can occur by chance alone, statistical methods, often combined with heuristics, are used to help

determine the significance of an alignment. For example, the BLAST algorithm computes the

expected frequency of matching sequences that should occur in an alignment search in order to

conduct a more efficient search.

In calculating an alignment score ( S ), the underlying question is usually "is the alignment score high

enough to suggest homology?" The first part of the answer is to determine how high a score could

occur by chance alone. However, the challenge here is no mathematical theory adequately describes

statistics of the scores that can be expected for global alignments. In lieu of an underlying

mathematical basis for computing the significance of global alignments, ad-hoc methods have been

devised for comparing alignment scores with scores of random sequences that seem to align, using

sequences the same length and composition as those under study.

The situation is different for local alignment, because extreme value distribution adequately describes

the expected distribution of random local alignment scores. By relating the observed direct score to

the expected distribution, the statistical significance of alignment can be assessed.

A statistic commonly used in alignment searches is the z-score, which is a measure of the distance

from the mean, measured in standard deviation units. If each sequence to be aligned is randomized

and an optimal alignment is made, the result is a series of scores ( S ) for the alignment of two

sequences, with a mean (

) and standard deviation (

). In this scenario, the z-score ( z ) is computed

m

d

as:

The advantage of a z-score over a simple percentage score is that it corrects for compositional biases

in the sequence and accounts for the varying length of sequences. The problem with using a z-score

to assess whether an alignment occurred by chance is that a z-score assumes a normal distribution.

However, alignment data don't follow a normal distribution. As a result, a higher z-score should be

taken as a threshold of significance.

Distributions have different uses in bioinformatics statistical works. Binomial distributions are used for

spotting stretches of DNA with unusual nucleotide sequences and pair-wise sequence comparisons.

Normal distributions are used for modeling continuous random variables, with applications such as

the statistical significance of pairwise sequence comparison. Multinomial distributions are used for

spotting stretches of DNA with unusual content, distinguishing tests for introns by composition, and

quantifying relative codon frequency.

Relying solely on purely mathematical methods for statistical analysis without incorporating heuristics

or knowledge of the underlying biology can often lead to incorrect conclusions. For example, a run of

pure C-G sequences in a sequence to be aligned will likely match many C-G-rich regions in a

sequence database. Based on this knowledge, masks can be used to hide these regions from the

database search, allowing the search algorithm to ignore these regions during the search process.

Bioinformatics Computing

Search WWH ::

Custom Search

Home