Pseudogenes (Genomics)

Pseudogenes are genomic regions that derive from genes but have no function (Balakirev and Ayala, 2003; Mighell et al., 2000; Vanin, 1985; Zhang and Gerstein, 2004). Because they have been traditionally assumed to present no biological relevance, the identification and classification of pseudogenes has attracted little attention and was initially restricted to evolutionary studies. Over the years, a significant number of pseudogenes have been accidentally uncovered during the analysis of gene-containing regions in different genomes, but their sequences and locations were rarely reported to public databases. It is in only recently that scientists became aware of the importance of identifying and annotating pseudogenes. Part of this growing attention derives from the need to provide basic annotation of recently sequenced genomes, which should include pseudogenes. But it is from the observation that a significant fraction of likely nonfunctional regions (pseudogenes) were being wrongly annotated as functional genes during preliminary analyses of completed large genomes (Gibbs et al., 2004; Lander et al., 2001; Waterston et al., 2002), that the identification and classification of pseudogenes has become extremely relevant. It is now required for any automatic approach for predicting genes to take into account the presence of pseudogenes and to include some rules to distinguish them. Finally, because of the increasing number of reports providing evidence that some regions previously classified as pseudogenes are actually functional, normally involved in the regulation of gene expression (Korneev et al., 1999; Healy et al., 1991; Hirotsune et al., 2003), collections of dead genes become an excellent source for the identification of additional examples of this type.


Although pseudogenes can exceptionally arise from the direct inactivation of functional genes, the vast majority of them are formed through gene duplication. In mammals, two major mechanisms of gene duplication coexist, with different impact on pseudogene formation and gene evolution: (1) The segmental duplication of genes results in the formation of identical copies, mostly through unequal crossing-over during meiotic recombination (Alberts et al., 1994). This type of duplication might also involve the copy of the promoter and other regulatory regions along with the coding region. Therefore, favored by an initial period of functional redundancy and relaxed selective constraint, expressed duplicates might undergo sequence modifications that lead to the acquisition of new, or more specialized functions. Although the ultimate fate of most segmental duplicated genes is unclear in terms of their preservation and the acquisition of new functions, it is commonly accepted that, one of the copies usually becomes a nonprocessed pseudogene through the accumulation of lethal mutations, whereas the other retains the original (or eventually enhanced) function (Force et al., 1999; Ohno, 1970; Lynch and Conery, 2000). Alternatively segmental duplications might only involve fragments of genes that will likely result in the formation of incomplete and hence nonfunctional gene copies. (2) A second mechanism of gene duplication corresponds to retrotransposition, which generates processed gene copies. This duplication mechanism results in the copy of mature RNAs, through their retrotranscription and integration back into the genome, using the machinery of retrotransposable elements (Esnault et al., 2000). In contrast to some segmental gene duplicates, processed mRNA copies carry no signals for transcription initiation. Because mRNA copies insert randomly in the genome, that is, likely far from active promoters, retrotransposed gene copies are rarely ever expressed and can therefore be considered “dead on arrival”. However, there are isolated examples of functional genes that have arisen through the retrotransposition of mRNAs (Lander et al., 2001; Brosius, 1999; Burki and Kaessmann, 2004). The absence of introns and the presence of flanking repeats and polyA tail are clear characteristics that can be used (when detectable) to distinguish retrotransposed gene copies from segmental duplicated copies.

But how are pseudogenes recognized? Following the general definition, a given genomic region can be cataloged as pseudogenic when it shows homology to an active gene and has no function. Whereas homology is relatively easy to identify through the detection of significant sequence similarity, the absence of function remains impossible to prove, given all possible (new and subtle) types of biological roles in which a particular region can be involved. Nevertheless, a few characteristics that have been found in a number of gene duplicates have been accepted as strong indicators for the absence of functionality. These features have been used to report, not only single cases, but also large collections of pseudogenes from different organisms.

The first reports describing pseudogenes relied on the absence of transcription to propose the absence of functionality. In fact, the term pseudogene was first introduced in 1977 to describe a tandem copy of the Xenopus laevis 5S RNA gene for which evidence of expression could not be initially found (Jacq et al., 1977). The same region was later shown to be actually expressed but with “inefficient termination” of transcription (Miller and Melton, 1981). Soon after, structurally identical copies of the rabbit and human globin genes were identified next to their functional paralogs and classified as pseudogenes, using again questionable arguments to propose nonfunctionality (Fritsch et al., 1980; Hardison et al., 1979; Lauer et al., 1980). Because the impossibility to detect expression of segmental duplicated genes can derive from methodological limitations, this argument is nowadays considered insufficient to suggest nonfunctionality. Even the classification of retrotransposed gene copies as nonfunctional or pseudogenic requires additional evidence, even though they are unlikely to ever be transcribed.

The detection of truncations, that is, in-frame stop codons or frameshifts, within duplicated protein coding regions is accepted as the most convincing evidence to disprove functionality, as their presence is generally not compatible with the synthesis of complete and operative proteins. Through systematic searches of truncated and nearly complete retrotransposed gene copies, two independent approaches identified in human genome, 3500 (Ohshima et al., 2003) and 8000 (Zhang et al., 2003) processed pseudogenes. Similar procedures detected processed pseudogenes in other genomes, including puffer fish (Dasilva et al., 2002), fruit fly (Harrison et al., 2003), worm (Harrison et al., 2001), and yeast (Harrison et al., 2002). But these approaches, despite presenting a high specificity, overlooked an important fraction of pseudogenes, which include those with an apparently intact coding region, but with other type of lethal mutations (e.g., replacements of functionally essential amino acids, or disrupted or missing promoters), pseudogenes arising from segmental duplications, and all pseudogenes with incomplete coding regions.

At the same time, another study used a different principle to evaluate the presence or absence of functionality for detectable gene copies in human (Torrents et al., 2003), which did not depend either on the presence of truncations or on the mechanism of duplication, that is, it covered both retrotransposed and segmental duplicated pseudogenes. 20 000 complete and partial gene copies identified between predicted and known functional genes were evaluated as to functionality, only taking into account the ratio of silent (synonymous, KS) to amino acid replacement (nonsynonymous, KA) substitutions, which indicates the associated levels of selective constraint (Li et al., 1981). KA/KS ratios of pseudogenes and those of the vast majority of genes are generally different, as mutations in genes causing amino acid replacements with functional consequences are selected against, in contrast to mutations occurring in pseudogenes. This functionality test revealed that nearly all identified gene copies were neutrally evolving and therefore nonfunctional. The regions identified with this approach, despite constituting the largest set of nonfunctional gene copies identified so far, were estimated to correspond to a fraction of the complete population of human pseudogenes, which is likely to exceed the number of genes. Following the same detection and classification strategy, a similar number of pseudogenes were identified in mouse and rat (Gibbs et al., 2004). From the comparative analysis of human and mouse orthologous DNA blocks, the same study showed that the majority (>70%) of the pseudogenes are located far from their functional paralogs, which is consistent with a retrotranspositional origin. This distinction also revealed that although the number of both processed and nonprocessed pseudogenes correlates with the size of the chromosomes in human, their intrachromosomal distribution differs: processed pseudogenes are more abundant close to telomeres, nonprocessed pseudogenes are normally enriched in gene dense regions.

All mammalian genomes investigated so far appear to have a high and similar number of detectable pseudogenes (~20 000), suggesting that they share similar mechanisms (and rates) for the formation and death of this type of regions (Gibbs et al., 2004; Torrents et al., 2003). On the other hand, other vertebrates, such as chicken, appears to heave nearly an undetectable number of processed pseudogenes (ICGSC, 2004), which is likely due to the lack of interaction between the machinery of active retrotransposons with host mRNAs. Similarly, a number of searches within nonvertebrate genomes revealed in general a low number of both processed and nonprocessed pseudogenes (Harrison et al., 2003; Harrison et al., 2001; Harrison et al., 2002; Zdobnov et al., 2002), which could be in agreement with the observed size constraints associated to their genomes (Petrov and Hartl, 2000).

Between the years 2001 and 2003, important progress has been achieved in the identification and classification of pseudogenes. Nevertheless, we expect that the sequencing of more genomes, and particularly the increasing availability of new experimental data revealing atypical forms of functionality, will provide, in a close future, additional criteria for the difficult task of distinguishing between functional and pseudogenic gene duplicates. This will then allow significant improvements to be made in the construction of pseudogene catalogs and to investigate their actual impact on gene and genome evolution.

Next post:

Previous post: