Information Technology Reference
In-Depth Information
sequences, that while being a very intuitive measure lacks a lot of information
concerning the evolutionary processes that shaped a gene (protein) family.
2.5.1. Computational approach
In the last years great improvements in phylogenetic methods has made it possible
to run very complex algorithms to understand the relationships existing within
a protein family. However, these methods could not use the enormous number
of proteins today deposited in the databases mainly because they need multiple
alignments, whose phylogenetic signal becomes deteriorated when very distant or
partial sequences are added to the analysis. On the contrary methods based on
relatively simpler algorithms, such as the local alignment tools Blast and Fasta, are
able to manage such large numbers of sequences. These, combined with network
theory and clustering algorithms, allow describing the structure of the evolutionary
network without the need to make multiple alignments and to manually rene them.
Moreover, when comparing very large numbers of sequences in this way it is also
possible to have both related and unrelated sequences within the same dataset and
these will be classied at the clustering stage. In the simplest example, the output
of a similarity search may be recoded as an adjacency matrix where each element
a ij is a measure of the similarity shared by proteins i and j and can be used as
input for a clustering algorithm to infer communities of related sequences. This
strategy has been used to construct the Clusters of Orthologous Groups (COG,
[65]) and OrthoMCL databases [35]. Advanced clustering techniques have been
implemented to better identify clusters of related proteins in large scale similarity
searches; tribeMCL [36] and orthoMCL [35] both implement a relatively recent
procedure, the Markov Clustering algorithm (Section 2.3.2) to partition proteins
into families.
The a ij element of the matrix used for clustering can be the normalized score
for the local alignment between sequence i and j; raw blast scores are not suitable
because they are the sum of substitution and gap scores taken from a given substi-
tution matrix (e.g. Blosum, Dayho) and they are consequently dependent on the
alignment length; this causes the risk of over-scoring long alignments; for this reason
it is recommended to normalize the matrix row by row on the basis of the diagonal
element s ii = s i max (the score of a self aligned protein is the maximum attainable
for each row); in this way each element becomes s norm
ij
= s ij
s i max . Alternatively,
the score can be transformed in bit form, which is normalized with respect to the
scoring system and thus can be used to compare alignment scores from dierent
searches; for Blast, S norm
bit
= S ln(K)
ln(2) , where and K are parameters of the Blast
run; bit scores subsume the statistical essence of the scoring system employed, so
that to calculate signicance one needs to know in addition only the size of the
search space.
After this transformation, the matrix encodes a network of normalized global
sequence relationships: nodes are proteins with edges between homologous proteins,
Search WWH ::




Custom Search