Chromosome phylogeny (Bioinformatics)

1. Introduction

Impressive sequencing and comparative mapping endeavors have made available detailed whole-genome sequences and maps (Lander et al., 2001; Venter et al., 2001; Waterson et al., 2002). One of the stated goals of these projects is to better our understanding of organismal biology through comparative analyses (O’Brien et al., 1999). The rationale is that insights into the dynamics of evolution can help ascertain the underlying adaptive biochemical and physiological responses. Comparisons can be at the raw sequence level, where the focus is typically on the analysis of individual genes, or at the coarse genomic level where the focus is on large-scale rearrangement events (see Article 42, Reconstructing vertebrate phylogenetic trees, Volume 7 and Article 44, Phylogenomic approaches to bacterial phylogeny, Volume 7). Dobzhansky and Sturtevant (1938) pioneered the latter type of analysis almost 70 years ago in a study of inversions in Drosophila pseudoobscura.

Since point mutations and rearrangement events act as parallel modes of evolution, they can be exploited concurrently to provide complementary perspectives on the relationships between the genomes under study. There are two important advantages of studying rearrangements, or chromosomal mutations: first, rearrangement analyses capture the evolution of a chromosome or a genome as a whole and not just locally. Second, because chromosomal mutations are rare events (Rokas and Holland, 2000), and because they are selected from a large set of possible candidates (e.g., there is a quadratic number of potential inversions that can affect a genome), they are less likely to suffer from homoplasy. This allows the reconstruction of scenarios even for deep phylogenies.


2. Algorithms and recent applications

There are different types of rearrangement events: some affect gene order (inversions, translocations, fusions, fissions, transpositions and inverted transpositions); while others affect gene content (insertion, deletion and duplication). Various methods have been developed to efficiently compute the rearrangement distance and sort a pair of genomes with equal gene content under different sets of operations; for instance, with inversions only (Hannenhalli and Pevzner, 1999; Bader et al., 2001; Bergeron, 2001), with inversions, translocations, fusions, and fissions (Hannenhalli and Pevzner, 1995; Tesler, 2002), and with transpositions (Meidanis etal., 1997; Bafna and Pevzner, 1998). Some of these methods were then further adapted for multiple genome comparisons and phylogeny reconstruction under a maximum parsimony criterion: with inversions only (Siepel and Moret, 2001) and with inversions, translocations, fusions, and fissions (Bourque and Pevzner, 2002). See Figure 1 for an example.

Even without specifying a model of rearrangement events, the relative gene order can be used to define the breakpoint distance that can then be exploited, again under a parsimony criterion, for phylogenetic tree reconstruction (Blanchette et al., 1997; Blanchette etal., 1999). Although models that are not restricted to equal gene content are more comprehensive (see El-Mabrouk, 2005 for a review), they are typically more challenging algorithmically and have been restricted to a limited number of phylogenetic applications (Sankoff et al., 2000; Earnest-DeYoung et al., 2004).

Prior to the availability of detailed large genome sequenced data, rearrangement studies were restricted to the analysis of gene orders in genomes like mitochondria or chloroplasts (Palmer and Herbon, 1988; Sankoff et al., 1992; Bafna and Pevzner, 1995; Blanchette etal., 1999; Cosner etal., 2000). With the significant progress in large-scale sequencing and comparative mapping, rearrangement studies had to be reconfigured into a two-step process:

1. Identification of homologous syntenic blocks (HSBs) shared by the set of genomes under study.

2. Comparison of the respective arrangements of these common blocks in the different genomes.

The HSBs that need to be identified in step 1 can be obtained either from information on orthologous genes (Zdobnov et al., 2002) or directly from homologous sequences (Kent et al., 2003; Pevzner and Tesler, 2003a). In both cases, a threshold needs to be set to allow the HSBs to extend over small local discrepancies that could originate from different sources: erroneous assignment of orthologous genes (especially in large gene families), assembly errors, smaller rearrangement events falling outside the model (e.g., transposons), and so on. There are advantages to using orthologous genes to identify HSBs: it focuses the analysis on important regions of the genome, the thresholds are length independent and the blocks are less sensitive to repeat sequences. Similarly, there are advantages to using raw sequence data: it avoids annotation problems, it works in noncoding regions, it is less sensitive to gene families and finally it preserves more information on microre-arrangements (rearrangements within HSBs) that can then be used as additional phylogenetic characters (Bourque et al., 2005).

Once the HSBs are identified, they can be processed in step 2 using algorithms for genome rearrangements in the same way genes were used previously. This two-step methodology was used to compare the human with the mouse genome (Pevzner and Tesler, 2003a) and suggested a larger number of rearrangements than previously expected (mostly intrachromosomal rearrangements). It also helped refine a model for chromosome evolution in which some breakpoints are reused nonrandomly (Pevzner and Tesler, 2003b; Larkin et al., 2003).

Mammalian X chromosome phylogeny. The arrangements of 11 homologous syntenic blocks, identified on the X chromosome of seven contemporary mammalian genomes (human, mouse, rat. cat. dog. pig. and cattle), are shown at the bottom of the tree. Blocks are drawn proportionally to their size in human and gaps are only shown in human to display coverage. A diagonal line traverses the blocks to show their order and orientation relative to human. The top of the tree exhibits the putative X chromosome ancestors in a most parsimonious inversion scenario as recovered by MGR (Bourque and Pevzner. 2002). The occurrence of an inversion is shown with a small cross on a branch of the tree but the exact timing of events on that branch is unknown. Hypothetical intermediate X chromosomes on a path are displayed using a white background. Data adapted from Murphy et al. (in press)

Figure 1 Mammalian X chromosome phylogeny. The arrangements of 11 homologous syntenic blocks, identified on the X chromosome of seven contemporary mammalian genomes (human, mouse, rat. cat. dog. pig. and cattle), are shown at the bottom of the tree. Blocks are drawn proportionally to their size in human and gaps are only shown in human to display coverage. A diagonal line traverses the blocks to show their order and orientation relative to human. The top of the tree exhibits the putative X chromosome ancestors in a most parsimonious inversion scenario as recovered by MGR (Bourque and Pevzner. 2002). The occurrence of an inversion is shown with a small cross on a branch of the tree but the exact timing of events on that branch is unknown. Hypothetical intermediate X chromosomes on a path are displayed using a white background. Data adapted from Murphy et al. (in press)

When more than two genomes are compared, rearrangement scenarios not only lead to the inference of phylogenetic relationships, they also provide rate estimates for the different branches of the tree and allow for the reconstruction of the putative architecture of ancestral genomes. A comparison of the human, mouse, and rat genomes (Bourque et al., 2004) confirmed the accelerated rate of interchromosomal rearrangements previously observed in lower-resolution studies of rodent genomes and predicted the genomic architecture of the murid rodent ancestor. The addition of the chicken genome acting as an outgroup (Bourque et al., 2005), allowed the reconstruction of the putative mammalian ancestral genome and further localized the lineage-specific chromosomal mutations. The analysis further revealed highly variable rates of genomic rearrangements across different branches of the tree with a particularly slow rate of interchromosomal rearrangements in the chicken lineage, in the early mammalian lineage, or in both.

Applications focusing on specific areas of the genomes allow for the identification of very detailed scenarios and possibly provide clues into the rearrangement mechanisms themselves. In the comparison of large vertebrate genomes, the X chromosome is a very interesting subproblem for studying chromosome evolution as it rarely exchanges genetic material with other chromosomes. Figure 1 shows the evolution of the X chromosome in seven mammalian genomes (human, mouse, rat, cat, dog, pig, and cattle); an example adapted from Murphy et al. (in press). This example highlights once again the unstable rate of genome rearrangements: about 85% (11/13) of the large-scale inversions affecting the X chromosome are found on the rodent branches whose total length only represents about 20% of the total length of the branches of the tree (corresponding to 500 Mya of evolution). What triggered and what are the consequences of this accelerated rate of inversions on the rodent X chromosome? These questions will require further work.

3. Limitations and future prospects

Many of the algorithms and applications described above still rely on a very simplistic model of evolution in which a limited set of equally likely operations are considered. Given the wealth of genomic data now available, it would be interesting, and challenging, to revisit these models, parameterize the different operations (in a way similar to what was done for mitochondria genomes (Blanchette etal., 1996)), and estimate their preponderance from the data itself. Such an experiment could potentially lead to a more realistic biological model of rearrangements but more importantly, it might provide new insights into the dynamics of genome evolution. In terms of making the rearrangement models more realistic, integrating other sources of information such as centromere and telomere positions could also prove valuable.

A related topic involves the development of a Bayesian framework for genome rearrangements. The methods described above have relied on a most parsimonious criterion but, in some cases, finding a most parsimonious scenario is insufficient.

For difficult problems (i.e., highly nonadditive trees), there are typically many optimal solutions and even the assumption that the actual history of rearrangement corresponds to a most parsimonious scenario becomes weak. A Bayesian approach has already been suggested for unichromosomal genomes (Larget et al., 2005) but has yet to be adapted to multichromosomal genomes. Such an approach would provide a richer description of the solution space but it would also allow a more natural parameterization of the rearrangement operations.

An emerging approach for chromosome phylogeny, similar in spirit to the analysis of breakpoints in that it does not require the specification of a rearrangement model, involves the analysis of conserved intervals in unichromosomal genomes (Bergeron et al., 2004). Although this approach does not analyze rearrangements directly, it can also identify phylogenetic relationships and reconstruct ancestral architectures by maximizing the conservation of the relative order of subsets of genes. It has the potential to be applicable to a wide variety of questions.

Another desirable improvement would involve the efficient expansion of genome rearrangement studies to genomes with unequal gene, or sequence, content. Although some preliminary studies have been moderately successful (Sankoff et al., 2000; Earnest-DeYoung et al., 2004), a tool to compare systematically whole eukaryotic genomes, not only on a set of common HSBs, is still needed. An intermediate solution could be to adapt multigenome studies to only require pairwise identification of common blocks.

As alternative methodologies for the analysis of genome rearrangements emerge, new systematic ways for comparing not only recovered phylogenies but also prediction at ancestral nodes will be required. Putative ancestors are difficult to compare because, in different studies, they are reconstructed using different sets of HSBs. So far, quality assessments have been limited to the analysis of predicted chromosomal associations (Bourque et al., 2004; Bourque et al., 2005; Murphy et al., in press) but it would be interesting to analyze more thoroughly the robustness of the rearrangement scenarios recovered when different choices of parameters, or completely different approaches, are used to generate HSBs.

4. Conclusion

Finally, as more and more detailed sequences and maps are released, lineage-specific breakpoint regions will be more systematically defined simply by the identification of HSBs. A recent study by Murphy et al. (in press) confirms that around 20% of breakpoint regions are reused and that those breakpoint regions are enriched for centromeres. The same study also finds interesting correlations between frequently rearranged regions, gene density, segmental duplications and recurring cancer breakpoints. Obviously, these evolutionary breakpoints represent an extremely rich source of information on the mechanisms regulating genome rearrangements and what can lead to their fixation in a population. As in-depth analyses of breakpoints regions are pushed further, the hope is that crucial information on these mechanisms will be uncovered. The challenge will then be to refine models and algorithms for genome rearrangements to employ that information proficiently.

Next post:

Previous post: