Connecting genes by comparative genomics (Bioinformatics)

A result of deep importance to modern molecular genetics is the observation that most biological functions are carried out by the products of multiple genes and that most genes are associated with multiple functions (Hartwell et al., 1999). Unraveling the functional relationships among genes thus provides a promising route to the molecular understanding of biological functions.

Examples of tight functional relationships among genes are ubiquitous. Metabolic pathways are composed of multiple gene products each corresponding to a step along the pathway. Genes of the same metabolic pathway or signaling cascade can be said to be functionally linked. Importantly, there is frequent cross-talk among the pathways, such that a gene may subserve multiple pathways: human gene GPI, glucose phosphate isomerase, is found in glycolysis, pentose phosphate, and starch and sucrose pathways. To complicate matters further, complexes of several gene products, each of which can also have membership in multiple other types of such complexes (Gavin et al., 2002), may actually accomplish a single metabolic step. Besides pathways, cohorts of genes regularly form functional systems such as the oxidative phosphorylation system and the flagellum. These systems are intricately structured complexes and here also we find that members of one system are actually modular and can be used in the context of multiple functional systems as in the type III secretion system and the flagellum (Gophna et al., 2003).


Ideally, all such interactions would be identified by experiments testing for all possible interactions among the genes. Indeed, high-throughput methods – such as yeast two-hybrid (Ito et al., 2001; Uetz et al., 2000) and tandem-affinity purification (TAP) coupled with mass spectrometry (Gavin et al., 2002) – are extremely useful in identifying functional links on a large scale. However, as high-throughput as these methods may be, they cannot be truly complete. Since there are on the order of N2 possible interactions, where N is the number of genes in a genome, it is not feasible to actually test all interactions. Moreover, these methods exhibit high false-positive rates, are highly work-intensive, and to date they mostly have been used to identify protein-protein physical interactions, a subset of the more general type of functional relationships described above.

An alternative complementary approach to the brute force experiments is not unlike the one applied to comparative anatomy in the nineteenth century. In 1843, Sir Richard Owen, the Superintendent of the British Museum, compared the skeletons of the various classes of vertebrates – fish, reptiles, birds, and mammals. From his comparisons, Owen boldly concluded that vertebrates have a common structural plan he called an “archetype”. Owen introduced the term homology to mean “the same organ in different animals under every variety of form and function” (1843). Thus, human fingers are homologous to bird fingers. While originally interpreting similarities as evidence of a creator’s “archetypes”, with Darwin’s introduction of evolutionary theory, homology took on a new significance. In light of evolution, homology is interpreted as similarity due to common ancestry. By comparing increasingly distant organisms, it is thus possible to trace the order of appearances of anatomical features. We find that an understanding of the historical narrative of a biological phenomenon can be of tremendous insight toward its functioning. Essentially, the relationships among a skeleton’s bones are revealed by exploiting the common ancestry of extant organisms through a comparison of the disposition of their homologies.

Relationships among a genome’s genes may similarly be revealed by analyses of their disposition in other genomes. As expected, the operational unit of comparison in such an approach is a specific kind of homology called orthology, most easily defined as “the same gene in different genomes”. More exactly, an inference of orthology between two genes signifies a relationship of speciation. We infer that the two genes stem from one precursor gene that was present in the most recent common ancestor before the speciation event that dispersed the two lineages.

The ability to identify orthologous relationships proved to be of revolutionary importance to genetics. The reason is that an orthologous relationship provides a strong hint toward functional conservation. Although the length of time since divergence is a correlating factor, the notion that orthologs generally maintain the same function has found wide-range experimental support. Thus, orthology is the great simplifying notion of genetics; it offers the hope that each gene in every organism will not need to be studied separately to be understood. The knowledge that one has acquired by hard work in one organism can thus be transferred to its ortholog in another organism. For example, frog histones need not be thoroughly studied because their structure and function in yeast has already been worked out in detail and the general mechanisms can be transferred. An important caveat, however, is that orthologous genes often evolve to take on vastly different functions. Nonetheless, the delineation of orthologous gene families is of such fundamental importance that it should be considered the first law of functional genomics.

Working with orthologous families of genes allows for the reformulation of the goal of identifying gene relationships. Instead of seeking to link individual genes, it is most efficient to link orthologous families. Three methods have been proposed to detect relationships among families based upon complete genomes.

The phylogenetic distribution of the members of an ortholog family is informative. A given gene family may be, for example, specific to bacteria, scattered across the microbes, or present in all organisms examined. Consider that with 100 genomic sequences, there are 2100 (on the order of 1030) possible distributions. Specific enough to be informative, a family’s phylogenetic distribution may be taken as its fingerprint. An ortholog’s phylogenetic distribution – or phylogenetic profile – however is often not unique to it alone. In cases where another family’s profile is identical, or comparable, it may be a lucky incidence of no biological importance or it may actually serve as a hint of a relationship between the two families.

Correlated presences and absences in the same genomes of members of a pair of ortholog families may be explained by a functional link. In other words, if the pair of genes is required for a certain function, a genome would not be expected to have only one of them. Thus, tightly coupled genes would indeed be expected to have identical profiles of occurrences (Date and Marcotte, 2003; Huynen and Bork, 1998; Pellegrini et al., 1999; Tatusov et al., 1997; Wu et al., 2003). The bacterial flagellum is a good example of a functional system with conserved phylogenetic profiles (Pellegrini et al., 1999). The principle of phylogenetic profiling has been used to identify novel pathways (Date and Marcotte, 2003).

The physical proximity of a pair of genes along the chromosome may also be of functional importance. Chromosomal proximity of genes may be maintained by selection due to a shared mechanism of expression such as operons, or perhaps, as is common in Bacteria and Archaea, because their proximity offers a better survival rate for horizontal transfer of the paired genes (Lawrence, 1999). In both instances, a functional link is appropriate. On the other hand, proximity between two genes may be of no biological importance. Adopting a comparative genomic approach, these two instances may be distinguished. It has been shown that the number of genomes in which proximity was found is correlated to the accuracy of the functional relationship prediction (Yanai et al., 2002). Thus, for a given pair of ortholog families, we might examine the relative proximity of the pair across genomes (Overbeek et al., 1999). If in a certain number of genomes the genes are proximate, we consider the two gene families linked. Interestingly, the families may be linked even if the proximities are not held in all of the genomes and, consequently, a pair of genes in one genome may be said to be linked by chromosomal proximity even if they are not actually proximate (Yanai et al., 2002).

Unfortunately, functional links by chromosomal proximity seem to hold only for Bacteria and Archaea examined thus far. This may be due to the fact that operon organization is most significantly a phenomenon that is limited to these kingdoms. Eukaryotic genomes tend to be less streamlined, and thus chromosomal organization may be dominated by different evolutionary forces. Regardless of this limitation, the general applicability to Bacteria and Archaea is very useful.

A third type of functional link takes advantage of an extreme case of chromosomal proximity. It has been observed that two genes that are distinct in one organism may be found as one contiguous fusion gene in another organism (Enright et al., 1999; Marcotte et al., 1999). A fusion gene generally unites genes of the same metabolic pathway or protein complex. Thus, even if nothing is known about the two genes, a functional link based upon a fusion elsewhere could be inferred on the basis of the fusion event in one (or more) of the genomes. Indeed, it has been statistically shown that fusion links tend to pair up genes of the same broad functional category (Yanai et al., 2001).

The major difficulty with the fusion method is that it often appears to be a special case of a general mechanism of modular genes. Most human genes, as well as those of many other genomes, are composed of multiple domains – independent structures serving as evolutionary conserved building blocks (Ponting and Russell, 2002). Since domains, such as SH2 domains for example, appear in many different kinds of genes, there is no reason to suppose a direct functional link between the domain partners. Thus, the specificity of the fusion method can be increased if fusion links are constrained to only those domains that do not appear to be promiscuous (Marcotte etal., 1999).

A major confounding phenomenon of these three methods involves the issue of gene duplication. Very frequently, a gene has multiple orthologs in another genome, signifying that the original gene underwent duplication since speciation (Jordan et al., 2001). The question of how gene function evolves following a duplication event is, in fact, one of the most interesting open questions in the genomics field today. Since functional conservation within an orthologous family is the first assumption in all the methods, the applicability of the three genomic context methods may be called into question when gene duplications have occurred across the lineages.

How applicable are phylogenetic profiling, chromosomal proximity, and gene fusion, to a genome’s genes? On average, functional links in microbial genomes contain an average of 57% of an organism’s complete genetic complement (Yanai and DeLisi, 2002). Interestingly, there is very little overlap in terms of links generated by the methods but considerable overlap in the genes involved in the links (Yanai and DeLisi, 2002). Thus, the links are largely additive to form networks of interactions. The links uncover substantial portions of known pathways, and suggest the function of previously unannotated genes.

The three comparative genomics methods described here apply evolutionary theory to unravel the functional organization of genomes. Comparative genomics can essentially be seen as a decent substitute for a time machine. While ancestral states of an organism would only be available to us with the use of a time machine, comparing extant genomes leads to inferences of these very states. As in other realms, knowledge of the past makes way for understanding the present.

Next post:

Previous post: