Biomedical Engineering Reference
In-Depth Information
PROTOCOL 9.2 Computing a Phylogenetic Profile for a Gene b
Requirements
Internet access to download data and software
Programming or scripting language such as C, C++, Perl, Matlab.
Method
1 Decide on which genomes to use for profile. See Sun et al . [60] for guidelines on
choosing appropriate genomes.
2 Construct a vector with each column representing a selected genome and initialize the
value of each column to zero.
3 Given the protein sequence of the gene of interest, perform a sequence similarity
search using BLAST or PSI-BLAST [58] against protein sequences from the selected
genomes. Protein sequences can be obtained from SwissProt http://www.ebi.ac.uk/
swissprot/.
4 Retrieve all matching proteins below E -value threshold as homologues.
5 In Pellegrini et al .[4],the P -value threshold is computed by 1/ nm ,where n is the
number of proteins in the genome where the sequence is taken from and m is the
number of proteins from other genomes. The P -value is defined as 1 e E ,where E is
the BLAST E -value. c
6 For each genome in which a homologue is found, the corresponding column in the
vector is assigned a value of one.
Notes
b This is the original approach described in Pellegrini et al .[4].Asimplermethodtoconstruct
phylogenetic profiles using the COG database is described in Natale et al . [65].
c More details on the P -value and E -value, as well as other statistical measures used in BLAST,
can be found at http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html.
Comparing genes based on evolutionary trees better reflects the evolutionary similari-
ties of genes, but is conceptually more complex and computationally more expensive. An
appealing approach that avoids the complexity of such methods, but yet can achieve similar
sensitivity, is a heuristic approach proposed by some researchers [66]. The heuristic approach
considers the underlying phylogeny of the profiles by first ordering reference organisms
based on evolutionary distances. Genes that truly co-evolve are likely to be conserved in
distant organisms; hence, the matching organisms are less likely to occur in consecutive
runs in the sorted profiles. A hypergeometric approach is used to compute the probability
that random gene pairs achieve equal or less matching runs than observed between a pair of
genes. The larger the numbers of observed runs, the more likely the genes are co-evolved
and, hence, share functions.
Search WWH ::




Custom Search