Information Technology Reference
In-Depth Information
n
¦
A
A
B
B
i
i
[12]
r
i
1
1
1
§
n
·
§
n
·
2
¦
¦
2
2
A
A
B
B
©
¹
©
¹
i
i
i
1
i
1
The correlation coefficient is 1 for identical vectors, is around zero of dissimilar
vectors and is -1 for anticorrelated vectors ( A i = B i ). A large number of related coefficients
are given in [32, 33].
2.8 Probability based measures
The final class of coefficients identified by Sneath and Sokal [25], probability based
coefficients, take account of the frequency distribution of variables over the entire data set.
Probability-based coefficients are less often used for small molecules, on the other hand
they are the most often used method of scoring in biological sequence comparison.
Probability-based measures are obtained by first calculating a raw proximity measure PM
between a query and all members of a dataset. This is followed by rescaling the raw PM
using knowledge on the distribution of scores. This operation places the PM values on a
common scale and thus provides an obvious way to set significance threshold for the hits of
interest. It is customary to distinguish “biologically meaningful” and “random” similarities.
The former are those between evolutionarily (homologs, orthologs, paralogs) or structurally
related proteins (molecules with a common fold), the rest of the similarities are usually
considered “random”.
One approach is based on the distribution of random similarities. If the distribution
is known in analytical or numeric form, then the statistical significance of any computed
measure - the probability P ( 0 d P d 1 ) for finding a given value in the given dataset by
chance - can be estimated. Random similarities occur more likely for larger queries and for
larger databases, so the description of random distributions usually includes query size and
database size as variables. (The product query size x database size is sometimes referred to
as the search space ). Current biological databases provide a sufficiently large number of
data for modeling the distribution of random similarities, and - at least for sequence data -
various random shuffling techniques can be used to generate larger datasets. This approach
thus consist in rescaling a proximity measure PM to give a probability P for a given search
space. This P is called statistical significance, in other words, if the value of proximity
measure lies far outside the distribution of random scores (P is very small), one tends to
consider it biologically significant, and conversely, large P values indicate random
similarities that are unimportant in the biological sense.
Another approach relies on the distribution of the target similarities, i.e. the
distribution of PM within a biologically important group of objects. Often there are not
enough reliable data for the analytical modelling of this target distribution, and random
shuffling techniques may not be easily applicable same as for random similarities. A
compromise solution consists in concentrating on the distribution of biologically significant
as well as random similarities in the neighbourhood of a target group[34, 35]. This
approach relies on the fact that space defined by existing macromolecules is sparsely and
unevenly populated (as compared to the hypothetical space of all possible molecules), and
the neighbourhoods of existing similarity groups may be quite different.
Further kinds of probabilistic coefficients can be obtained if one represents the
objects themselves by some kind of a distribution, and then compares two distributions so
as to obtain a probabilistic estimate of their identity [36]. There are established methods for
Search WWH ::




Custom Search