Biology Reference
In-Depth Information
use edit distances, breakpoints, rearrangements, recombination,
comparative mapping, and gene order have been introduced
[ 21 - 30 ]. While these methods do not require multiple alignments
they still tend to be computationally expensive.
The relative complexity measure utilized in GRAMALIGN is an
alignment free distance measure which is also computationally
efficient. The use of the relative complexity measure for computing
phylogenetic distance was demonstrated via successful construction
of phylogenies for eutharians using complete mitochondrial gen-
omes [ 31 ]. This measure was further validated [ 32 ] by using it for
phylogenetic analysis of medically relevant fungal species using the
cytochrome b gene, the 18S rDNA gene and the ITS region. The
topology constructed using this measure was robust to random
removal of significant portions (up to 40 %) of the genes. We
have made use of the computational efficiency of this distance
metric in both GRAMALIGN and to develop a clustering algorithm
that was validated by clustering large 16S data sets [ 33 , 34 ]. GRA-
MALIGN itself has been validated and incorporated into the primer
design pipeline UniPrime2 [ 35 ]. Finally, this measure was used to
cluster protein families into functional subtypes [ 36 ]. In all these
applications the distance measure is used to obtain the phylogenetic
topology and thus what has been important has been the ordering
of the distances between sequences.
The relative complexity measure of distance is based on an approxi-
mation to Kolmogorov complexity [ 37 ] by Lempel-Ziv complexity
[ 38 ]. The Lempel-Ziv (LZ) complexity of a sequence is the num-
ber of distinct phrases in the left-to-right parsing of a sequence.
A distinct phrase is one which has not occurred in the history of a
sequence. Consider the sequence Q
2.2 Relative
Complexity Measure
gagacagt . Initially the his-
tory of the sequence is the empty string. As g is not in the empty
string g becomes the first distinct phrase. The history of the
sequence now consists of the single letter g .As a is not in
the history, a becomes the second unique phrase. The history of
the sequence is now ga . The next two letters are g and a . As the
sequence ga exists in the history we keep building the phrase until
we get to gac which is not in the history of the sequence. Thus, gac
is the third unique phrase. The history is now the sequence gagac .
The next unique phrase in the sequence is agt , ag being available in
the history. Thus the sequence gagacagt can be represented by four
unique phrases and therefore has an LZ complexity of four. We can
represent the parsing by
¼
agt
and the LZ complexity by c ( Q ). We can see that sequences that have
more diversity will result in higher LZ complexity and sequences
that are more uniform will result in smaller LZ complexity. If we
concatenated this string with another similar to it, the increase in
Q
¼
g
a
gac
Search WWH ::




Custom Search