The Structural Network Properties of Biological Systems - Biological Networks

Information Technology Reference

In-Depth Information

sequences, that while being a very intuitive measure lacks a lot of information

concerning the evolutionary processes that shaped a gene (protein) family.

2.5.1. Computational approach

In the last years great improvements in phylogenetic methods has made it possible

to run very complex algorithms to understand the relationships existing within

a protein family. However, these methods could not use the enormous number

of proteins today deposited in the databases mainly because they need multiple

alignments, whose phylogenetic signal becomes deteriorated when very distant or

partial sequences are added to the analysis. On the contrary methods based on

relatively simpler algorithms, such as the local alignment tools Blast and Fasta, are

able to manage such large numbers of sequences. These, combined with network

theory and clustering algorithms, allow describing the structure of the evolutionary

network without the need to make multiple alignments and to manually rene them.

Moreover, when comparing very large numbers of sequences in this way it is also

possible to have both related and unrelated sequences within the same dataset and

these will be classied at the clustering stage. In the simplest example, the output

of a similarity search may be recoded as an adjacency matrix where each element

a ij is a measure of the similarity shared by proteins i and j and can be used as

input for a clustering algorithm to infer communities of related sequences. This

strategy has been used to construct the Clusters of Orthologous Groups (COG,

[65]) and OrthoMCL databases [35]. Advanced clustering techniques have been

implemented to better identify clusters of related proteins in large scale similarity

searches; tribeMCL [36] and orthoMCL [35] both implement a relatively recent

procedure, the Markov Clustering algorithm (Section 2.3.2) to partition proteins

into families.

The a ij element of the matrix used for clustering can be the normalized score

for the local alignment between sequence i and j; raw blast scores are not suitable

because they are the sum of substitution and gap scores taken from a given substi-

tution matrix (e.g. Blosum, Dayho) and they are consequently dependent on the

alignment length; this causes the risk of over-scoring long alignments; for this reason

it is recommended to normalize the matrix row by row on the basis of the diagonal

element s ii = s i max (the score of a self aligned protein is the maximum attainable

for each row); in this way each element becomes s norm

ij

= s ij

s i max . Alternatively,

the score can be transformed in bit form, which is normalized with respect to the

scoring system and thus can be used to compare alignment scores from dierent

searches; for Blast, S norm

bit

= S ln(K)

ln(2) , where and K are parameters of the Blast

run; bit scores subsume the statistical essence of the scoring system employed, so

that to calculate signicance one needs to know in addition only the size of the

search space.

After this transformation, the matrix encodes a network of normalized global

sequence relationships: nodes are proteins with edges between homologous proteins,

Biological Networks

Search WWH ::

Custom Search

Home