Concepts of Similarity in Bioinformatics - Essays in Bioinformatics

Information Technology Reference

In-Depth Information

Distance measures calculated between identical molecular descriptions (vectors) are

zero, and may grow without limit for non-identical vectors. It is sometimes desirable to

have bounded values, for example the so-called Canberra metric is defined as:

[4]

so it is zero for identical values but remains below unity for non-identical vectors. The

metric properties of vector distance measures are important for clustering and for

evolutionary studies. For M to be a metric (metric distance), the following criteria have to

be fulfilled for all A, B, C from X: i)

(

)

the equality holding if and only if

A ;

ii)

M t (triangular

inequality). Metric properties are essential if a distance measure is to be used for clustering.

A string similarity measures S (eqn. 5) is applicable to clustering if there is an associated

distance measure

(

)

(

)

(symmetry);

iii)

(

)

(

)

(

)

M that has metric properties. f is a monotonous function, and

distance measures such as 1-kS (where k is a constant) are routinely used in clustering

applications.

( S

)

2.4 String similarity measures for biological sequences

A special class of proximity measures, sequence similarity scores are used to quantify the

matching (alignment) of protein and DNA sequences. The underlying mathematical concept

is the string distance. Let us first concentrate on bit strings consisting of zero/one values.

( Figure 6A ). The Hamming distance is the number of (zero to one or one to zero) changes

necessary to change string 1 to string 2. This can be used immediately for short character

strings of identical length, with the condition that only exchanges are possible, gaps are not

allowed. If we keep this condition, we can use a simple lookup table to store the costs of

exchanging one character against another. The situation is the same if we use overlapping

doublet or triplet words, etc. i.e. the Hamming distance is a simple distance that can be

unequivocally computed based on lookup tables, because the matching of the two strings is

considered unique.

The situation is quite different if we match strings of arbitrary length and allow gaps

(Figure 7). The string edit distance is defined as the minimal number of steps (insertions,

deletions and replacements) necessary to transform one word into the other. The proximity

measures used for biological sequences are defined as similarity coefficients (high values

for similar, low for dissimilar sequences) and contain cost factors for residue substitutions

as well as gaps (insertions, deletions).

cos

[5]

identities

tss

replacemen

gaps

Essays in Bioinformatics

Search WWH ::

Custom Search

Home