Gene Structure (Molecular Biology)

A gene is widely understood as the fundamental unit of genetic information, but a detailed description of a gene is not straightforward.

1. Intragenic Recombination and Colinearity with Polypeptide Chain

The first step in elucidating the structure of the gene was recognizing that it is subdivisible into linearly arranged, individually mutable sites. Until the 1950s it was assumed that genes are both units of function and indivisible units of genetic transmission. The functional criterion for different mutants affected in the same gene, that is allelic, was that, when brought together in a heterozygote or heterokaryon, they could not make good each other’s deficiencies to produce a nonmutant phenotype. This criterion is still a good general guide, although, as detailed under Interallelic Complementation, there are circumstances in which it fails. The indivisibility criterion meant that two different mutant alleles of the same gene should not be able to recombine to yield a wild-type allele. This was sustainable only as long as only moderate numbers of meiotic segregants from heteroallelic diploids were screened.

In the fruit fly Drosophila melanogaster, M. M Green, working with the lozenge gene, and E. B. Lewis, with different genes, including white, demonstrated rare crossing over between mutants previously thought to be allelic and dubbed these recombining mutants pseudoallelic (1). Then Pontecorvo and his colleagues found that the sexually reproducing fungus Aspergillus nidulans produces wild-type recombinants from crosses between several pairs of alleles, with a frequency on the order of 1 in 10,000, and challenged the idea of pseudoallelism, thinking it likely that mutual recombinability is a typical feature of mutants that are allelic in the functional sense. This view was strengthened by Benzer’s analysis of hundreds of mutations within the rII gene of bacteriophage T4, in which he showed that very nearly all pairwise combinations of noncomplementing mutants would yield wild-type virus at some frequency from mixed infection of bacterial cells (2). It was rare for different mutations within the same functional gene to fall at exactly the same site and, when at different sites, they would recombine with one another. It was shown that the same principle holds for the bacteria, Escherichia coli and Salmonella typhimurium, where the method of analysis was mainly transduction, and in fungi and Drosophila, where recombination within genes occurred during meiosis. In the yeast Saccharomyces cerevisiae the frequency of intragenic meiotic recombination is unusually high, up to several per cent of meiotic products.


When recombination between allelic mutants was discovered, it became possible to map sites of mutation within a gene in a linear sequence. There are two general ways of doing this. The first uses flanking markers, gene differences that have visible effects (markers) closely placed on each side of the gene under analysis. If intragene recombination is due to classical crossing over, the flanking markers occur mainly in one or other of the two "crossed-over" combinations among wild-type interallelic recombination products, depending on which way the mutational sites are placed with respect to the flanking markers. Figure 1 shows an example from the analysis of the Drosophila rosy gene by Chovnick et al. (3).

Figure 1. Recombination within the rosy (ry) gene of Drosophila, which encodes xanthine dehydrogenase (XDH). Larv; recombinant eggs from ry5/ry41 (XDH-) females were automatically selected on purine-containing food, which kills XI parents contributed only mutant ry. The constitutions of the eggs with respect to mutations in closely placed flanking gei further test-crosses. The ry+ eggs not recombined for the flanking makers are presumed to be due to conversion of one o without crossing over. The high degree of association of 5-41 recombination with one flanking marker crossover class h l26.

Recombination within the rosy (ry) gene of Drosophila, which encodes xanthine dehydrogenase (XDH). Larv; recombinant eggs from ry5/ry41 (XDH-) females were automatically selected on purine-containing food, which kills XI parents contributed only mutant ry. The constitutions of the eggs with respect to mutations in closely placed flanking gei further test-crosses. The ry+ eggs not recombined for the flanking makers are presumed to be due to conversion of one o without crossing over. The high degree of association of 5-41 recombination with one flanking marker crossover class h l26.

The second method, which gives clear results even when the recombination within the gene is due mainly to gene conversion without crossing over, relies on the principle of overlapping deletions. In any very extensive collection of allelic mutations, it is likely that some are due to deletion of segments of the gene, each of which overlaps several mutational sites and partly overlaps other deletions. No point mutation can recombine to form recombinants with a deletion that removes the corresponding nonmutant site, and neither do different deletions yield wild-type recombinants if they overlap. By crossing a suitable set of deletion mutants with each other and scoring the progeny for the presence versus absence of wild-type recombinants, one establishes the overlaps between the deletions, and then, by crossing the whole set to point mutations, localizes the mutational sites to one or other of the segments defined by the deletion overlaps. The method was first used by Benzer in his analysis of the T4 rII genes. An example from yeast, the cyc-1 gene which encodes the protein of the respiratory pigment cytochrome c, is explained in Fig. 2 (4).

Figure 2. Establishment of the sequence of mutant sites within the Saccharomyces cerevisiae CYC1 gene that encodes cytochrome c. A set of overlapping deletion mutants were each tested for recombination with point mutants: + or -indicates whether or not wild-type recombinants were formed. The absence of recombinants means that the site of the point mutation falls within the segment deleted in the other parent. The deletion overlaps define the sequence of the mutant sites. The amino acid codon affected by each point mutation was determined biochemically. The sequence of codons corresponds to the sequence of mutant sites.

Establishment of the sequence of mutant sites within the Saccharomyces cerevisiae CYC1 gene that encodes cytochrome c. A set of overlapping deletion mutants were each tested for recombination with point mutants: + or -indicates whether or not wild-type recombinants were formed. The absence of recombinants means that the site of the point mutation falls within the segment deleted in the other parent. The deletion overlaps define the sequence of the mutant sites. The amino acid codon affected by each point mutation was determined biochemically. The sequence of codons corresponds to the sequence of mutant sites.

From this and many other examples, the gene emerged as a very finely subdivisible unit. Recognizing that the gene material is DNA, the concept, which still holds, is that every base pair of the DNA sequence is a potential site for mutation, separable from its neighboring nucleotides at some low frequency by recombination. Taken together with the idea that each gene is responsible for a single enzymic protein, which had emerged from the early Neurospora work, it was an obvious guess that the linear sequence of mutable and recombinable sites within the gene corresponds to the linear amino acid sequence of the polypeptide chain of the protein product.

Drosophia rosy (ry) and Saccharomyces cyc-1 are particularly appropriate examples because their mutations affect well-characterized proteins, namely, the enzyme xanthine dehydrogenase in the first case and the protein of the respiratory pigment cytochrome c in the second. In both cases point mutations cause mostly single amino acid replacements in the polypeptide chain, and the sequence of the replacements along the polypeptide chain corresponds to the sequence of mutational sites along the gene map. The comparison is particularly extensive for yeast cyc1 (Fig. 2). Numerous other examples could be given, and the principle, called colinearity, is so well established now in molecular biology that further substantiation through genetic mapping is hardly necessary. Genes, or at least those genes responsible for protein structure, are linear codes for the amino acid sequences of polypeptide chains with exceptions that are mentioned later. Parallel with examples of colinearity came the molecular identification of the gene as DNA, the demonstration of transcription of genic DNA into RNA, and the working out of the genetic code through which messenger RNA (mRNA) is translated into polypeptide sequence.

The original "one gene – one enzyme" concept was incomplete in several ways. The immediate products of genes are RNA molecules, not all of which are messengers for protein synthesis (e.g., ribosomal, small nuclear RNA, and transfer RNA), and not all proteins are enzymes. Among the enzymes, many are multifunctional, and one function can often be lost by mutation without eliminating other functions of the same polypeptide chain. As a result, analysis by mutation, complementation, and recombination may subdivide a gene into several functionally distinct domains, which may at first be confused with separate genes. A multifunctional, multidomain gene is sometimes called a cluster gene, to distinguish it from a gene cluster. Most of the known examples are in fungi, but they occur in Drosophila (eg, Fig. 3) and also in mammals. This sort of gene complexity is more extensively discussed under Interallelic Complementation.

Figure 3. The Gart gene of Drosophila, which exhibits several kinds of gene complexity: (1) an intron-exon structure (common to most Drosophila genes; introns are shown as open and exons as black segments); (2) the full-length mRNA encodes a trifunctional polypeptide that has three different enzymatic activities, abbreviated to GARS, AIRS, and GART, that catalyze successive steps in purine biosynthesis; (3) it has two polyadenylation/transcript termination sites, T1 and T2, one within the fourth intron resulting in a truncated mRNA that encodes only the functionally independent GARS domain; (4) it has a "nested" gene within its longest intron that encodes a cuticle protein and itself with an intron.

The Gart gene of Drosophila, which exhibits several kinds of gene complexity: (1) an intron-exon structure (common to most Drosophila genes; introns are shown as open and exons as black segments); (2) the full-length mRNA encodes a trifunctional polypeptide that has three different enzymatic activities, abbreviated to GARS, AIRS, and GART, that catalyze successive steps in purine biosynthesis; (3) it has two polyadenylation/transcript termination sites, T1 and T2, one within the fourth intron resulting in a truncated mRNA that encodes only the functionally independent GARS domain; (4) it has a "nested" gene within its longest intron that encodes a cuticle protein and itself with an intron.

2. Transcribed and Nontranscribed DNA Strands and the Minimal Gene

In writing about gene structure, it is useful to use the terms "upstream" and "downstream " to indicate orientation with respect to the direction of transcription of DNA into RNA. Only one strand of the DNA duplex is usually transcribed, and it is important to remember that the transcribed strand of a protein-encoding gene is not the coding strand, in the sense that the codon sequence is as seen in the messenger RNA, but rather the opposite-polarity complement of that sequence. The transcribed strand runs upstream-to-downstream in the chemical direction 3′ to 5′, whereas the mRNA and its amino acid codons run 5′ to 3′. To deduce the amino acid sequence from the transcribed strand using the standard codon tables (see Genetic Code), one must mentally convert it into its complement: A to U, T to A, G to C, and C to G. The amino acid sequence can be read directly from the nontranscribed strand, which is therefore often called the coding strand. It is best to distinguish the strands as transcribed or nontranscribed, bearing in mind that the sequence of the latter bears the immediately recognizable codons.

Once defined as a unit of transcription into RNA, the gene may be a relatively compact structure, at least when measured against more than 109 DNA base pairs in a typical mammalian genome or the even greater DNA contents (C-values) of some amphibia and higher plants. We can enumerate the essential components of a protein-encoding gene as follows. There must be a promoter segment to which RNA polymerase binds to initiate transcription a little way upstream of the transcription startpoint (with the exception that RNA polymerase III works with downstream promoters), and a transcription termination and polyadenylation signal. Within the transcribed region there are sequences corresponding to the different sections of the messenger transcript: an untranslated leader sequence (of various lengths in different genes and sometimes important for controlling translation), a ribosome binding site and initiation codon, an open reading frame, and a termination codon. The yeast cyc1 gene (Fig. 2) is a good example of a simple gene.

3. Expansion of the Gene Beyond the Coding Sequence

To be transcribed into mRNA for polypeptide chains or other kinds of RNA molecule, genes need be no more than a few thousand base pairs (kilobase pairs or kbp) long. How, then is one to account for the apparently huge surplus of DNA in the genomes of many higher organisms? Humans, for example, have about 30 kb of DNA per gene if the current estimate of about 100,000 genes is correct. Part of the answer lies in the host of repetitive and arguably functionless sequences that have become established. But probably equally important quantitatively are the intervening sequences (introns) that subdivide the coding sequences into fragments (exons) that are spliced together, and the introns spliced out, after the gene has been transcribed. The sizes and numbers of introns vary widely among different organisms. They are hardly present at all in bacteria in only a few genes and then seldom more than a hundred bases long in yeast, more numerous but still short in filamentous fungi, sometimes much longer in Drosophila, and often both very numerous and extremely long in mammals. The typical mammalian gene consists of rather short exons, each on the order of only hundreds or even tens of base pairs, separated by introns extending up to tens of kilobases long. Thus a gene, whose function is to encode a polypeptide chain of just a few hundred amino acid residues, may be spread over a tract of some 100,000 base pairs.

In addition to introns, other sequences, which may fall some distance outside the units of transcription, have some claim to be considered parts of the gene and to expand its domain. Although enhancers may occur within introns, they are probably more often outside the transcribed gene sequence, often upstream but sometimes downstream. The same applies to silencers. Insofar as enhancer sequences are cis-acting, which they usually are, they may be considered as falling within the gene boundaries. If so, this may somewhat complicate the idea of the gene as a discrete functional unit if the same enhancer services two or more different transcriptional units. Thus, a single locus control region, an approximately 20-kbp sequence, not usually called an enhancer but at least some elements of which function as such, acts on all of the genes of the human b-globin gene cluster. Genes are also functionally linked through chromatin structure, changes in which repress or release the transcription of blocks of genes (see also Epigenetics, Position Effect).

4. Programmed Gene Restructuring

Although most genes in most organisms have constant DNA sequences, the same in the cells where they are transcribed as in the germ cells through which they are transmitted, there are examples of genes in a wide range of organisms that are restructured in the process of cellular differentiation. In unicellular organisms, bacteria, yeasts, and protozoa, where any cell can found a new population, gene restructuring is always reversible within the cell. Thus the switching of flagellar antigen in the bacterium Salmonella typhimurium is brought about by the inversion of a DNA segment with the effect of "switching off" one antigen-encoding gene by separating it from its promoter and thereby switching on another antigen gene of which the first antigen is a repressor (7). The inversion is reversible. In Saccharomyces yeast, the well-known mating type switch is due to replacing a segment of DNA at a transcriptionally activating site with a segment copied from another locus, where it had been transcriptionally silent. The potential for both mating types is held at silent "cassette" loci at all times, but only one mating type is expressed from the segment present at the activating (or, more correctly, nonsilencing) locus. Similar mating-type switching occurs in the fission yeast, Schizosaccharomycespombe. A system not dissimilar in principle but with more options that again involves the transfer of gene sequences from silent to expressed loci, operates to switch the major surface antigen of the pathogenic protozoan Trypanosoma brucei (8).

In all the cases just mentioned, the potential for switching back to the status quo ante is retained within the cell nucleus. The situation is very different in ciliated protozoa, such as Paramecium, Tetrahymena, and Oxytricha, in which two kinds of nuclei are in the binucleate cells: a virtually "silent" micronucleus, which contains the basic genetic material and has the potential for differentiation in different directions (e.g., to different mating types or different surface antigens) and an active macronucleus, in which a selection of genes is amplified and restructured to support a particular cell type. At meiosis, which generates haploid nuclei for sexual cross-fertilization or for a kind of self-fertilization called autogamy , the macronuclei degenerate and disappear and are replaced by division and restructuring of the micronuclei, with the opportunity for switching cell type. Some of the gene rearrangement seems quite bizarre. For example, in Oxytricha trifallax, a gene that encodes an actin protein whose exons are labeled 1 to 10, upstream-to-downstream in the macronucleus, was reshuffled from the order 3, 4, 6, 5, 7, 9, 10, 2, 1, 8 in the micronucleus (9).

Gene restructuring in mammals is the exception rather than the rule, but is centrally important in the immune system. The functional genes for the virtually infinitely large number of different immunoglobulins and T-cell receptor proteins are pieced together in a large number of possible permutations from DNA segments that are separated by hundreds of kilobases in undifferentiated stem cells. Note that the nucleic acid splicing involved here occurs at the level of DNA, not RNA as in splicing-out of introns. This topic is dealt with in more detail under Complex Loci, Immunoglobulins, and T-cell receptor.

A different category of incomplete genes are those made functional not by rearrangement at the DNA level but by splicing their RNA transcripts to leader sequences provided by what might be called supplementary genes elsewhere in the genome. Many or most messenger RNAs in the Trypanosomes acquire leader sequences and cap sites in this way (10). Something similar occurs in nematodes, and such systems may be widespread in the less well explored "lower" eukaryotes.

Next post:

Previous post: