Prediction of Protein Function - Genomics: Essential Methods

Biomedical Engineering Reference

In-Depth Information

PROTOCOL 9.2 Computing a Phylogenetic Profile for a Gene b

Requirements

Internet access to download data and software

•

Programming or scripting language such as C, C++, Perl, Matlab.

•

Method

1 Decide on which genomes to use for profile. See Sun et al . [60] for guidelines on

choosing appropriate genomes.

2 Construct a vector with each column representing a selected genome and initialize the

value of each column to zero.

3 Given the protein sequence of the gene of interest, perform a sequence similarity

search using BLAST or PSI-BLAST [58] against protein sequences from the selected

genomes. Protein sequences can be obtained from SwissProt http://www.ebi.ac.uk/

swissprot/.

4 Retrieve all matching proteins below E -value threshold as homologues.

5 In Pellegrini et al .[4],the P -value threshold is computed by 1/ nm ,where n is the

number of proteins in the genome where the sequence is taken from and m is the

number of proteins from other genomes. The P -value is defined as 1 − e − E ,where E is

the BLAST E -value. c

6 For each genome in which a homologue is found, the corresponding column in the

vector is assigned a value of one.

Notes

b This is the original approach described in Pellegrini et al .[4].Asimplermethodtoconstruct

phylogenetic profiles using the COG database is described in Natale et al . [65].

c More details on the P -value and E -value, as well as other statistical measures used in BLAST,

can be found at http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html.

Comparing genes based on evolutionary trees better reflects the evolutionary similari-

ties of genes, but is conceptually more complex and computationally more expensive. An

appealing approach that avoids the complexity of such methods, but yet can achieve similar

sensitivity, is a heuristic approach proposed by some researchers [66]. The heuristic approach

considers the underlying phylogeny of the profiles by first ordering reference organisms

based on evolutionary distances. Genes that truly co-evolve are likely to be conserved in

distant organisms; hence, the matching organisms are less likely to occur in consecutive

runs in the sorted profiles. A hypergeometric approach is used to compute the probability

that random gene pairs achieve equal or less matching runs than observed between a pair of

genes. The larger the numbers of observed runs, the more likely the genes are co-evolved

and, hence, share functions.

Search WWH ::

Custom Search

Home