Biomedical Engineering Reference
In-Depth Information
PROTOCOL 9.2 Computing a Phylogenetic Profile for a Gene
b
Requirements
Internet access to download data and software
•
Programming or scripting language such as C, C++, Perl, Matlab.
•
Method
1 Decide on which genomes to use for profile. See Sun
et al
. [60] for guidelines on
choosing appropriate genomes.
2 Construct a vector with each column representing a selected genome and initialize the
value of each column to zero.
3 Given the protein sequence of the gene of interest, perform a sequence similarity
search using BLAST or PSI-BLAST [58] against protein sequences from the selected
genomes. Protein sequences can be obtained from SwissProt http://www.ebi.ac.uk/
swissprot/.
4 Retrieve all matching proteins below
E
-value threshold as homologues.
5 In Pellegrini
et al
.[4],the
P
-value threshold is computed by 1/
nm
,where
n
is the
number of proteins in the genome where the sequence is taken from and
m
is the
number of proteins from other genomes. The
P
-value is defined as 1
−
e
−
E
,where
E
is
the BLAST
E
-value.
c
6 For each genome in which a homologue is found, the corresponding column in the
vector is assigned a value of one.
Notes
b
This is the original approach described in Pellegrini
et al
.[4].Asimplermethodtoconstruct
phylogenetic profiles using the COG database is described in Natale
et al
. [65].
c
More details on the
P
-value and
E
-value, as well as other statistical measures used in BLAST,
can be found at http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html.
Comparing genes based on evolutionary trees better reflects the evolutionary similari-
ties of genes, but is conceptually more complex and computationally more expensive. An
appealing approach that avoids the complexity of such methods, but yet can achieve similar
sensitivity, is a heuristic approach proposed by some researchers [66]. The heuristic approach
considers the underlying phylogeny of the profiles by first ordering reference organisms
based on evolutionary distances. Genes that truly co-evolve are likely to be conserved in
distant organisms; hence, the matching organisms are less likely to occur in consecutive
runs in the sorted profiles. A hypergeometric approach is used to compute the probability
that random gene pairs achieve equal or less matching runs than observed between a pair of
genes. The larger the numbers of observed runs, the more likely the genes are co-evolved
and, hence, share functions.