Information Technology Reference
In-Depth Information
Comparison of sequences, protein 3D
structures and genomes
László KAJÁN 1 , Kristian VLAHOVICEK 1,2 , Oliviero CARUGO 1,3 , Vilmos ÁGOSTON 4 ,
Zoltán HEGEDÜS 4 and Sándor PONGOR 1
1
Protein Structure and Bioinformatics Group, International Centre for Genetic Engineering
and Biotechnology, Area Science Park, 34012 Trieste, Italy
2 Molecular Biology Department, Biology Division, Faculty of Science, University of
Zagreb, 10000 Zagreb, Croatia
3 Department of General Chemistry, Pavia University, viale Taramelli 12, 27100 Pavia,
Italy
4 Bioinformatics Group, Biological Research Center, Hungarian Academy of Sciences,
Temesvári krt. 626726 Szeged, Hungary
Abstract. The analysis of similarity is a fundamental task in comparing sequences,
three dimensional structures as well as genomes and molecular networks. This
chapter reviews the common principles underlying these diverse applications.
Introduction
The basic concepts of similarity analysis - as presented in the first part of this review -
provide a common framework for the classification of newly identified the protein
sequence or protein 3D structure. Classification of an object implies placing it into the
already existing categories or marking it as “unknown” i.e. as a potential initiator of a new
category. This process usually consists of the following steps.
Recognition of similarity . This is a qualitative decision that is often based on some
approximate quantitative measure. In sequence analysis, if the raw alignment score is above
a threshold, the similarity is considered significant and retained for further analysis. In the
case of protein 3-D structures the preliminary evaluation is often based on visual
inspection.
Next, the basis of similarity, i.e. a common substructure is identified. This is carried
out by matching of the equivalent entities and relationships, and sequence alignments as
well as structural alignments are the best examples. Determination of matching by
computers involves maximization of a similarity measure (or minimization of a distance
measure), and the final value of the respective parameters is used as a numeric measure of
similarity.
Evaluation of similarity . First a decision has to be made whether or not the
similarity is biologically important, and the protein is either assigned to a known similarity
group or it will be considered as the initiator of a new group. This decision is usually based
on one or more similarity scores as well as on the alignment, but human judgment is hard to
replace and at this stage.
Representation of similarity in databases . Once the similarity is established, it has
to be added to the annotation of the protein in the sequence and or 3-D databases. Protein
superfamilies, structural domains, orthologous groups etc. are determined by similarity
analysis, and there is large number of secondary databases that are dedicated to the curation
Search WWH ::




Custom Search