STRUCTURE AND FUNCTION OF NUCLEIC ACIDS

A. Basic Chemical Structure

The basic information for all activities in living systems, at least on our planet, is stored ultimately in nucleic acids, namely, deoxyribonucleic (DNA) and ribonucleic (RNA) acids. Except for certain viruses, DNA is the universal genetic material (Fig. 1). The chemical structures of basic units of RNA and DNA have been elucidated, and both types of nucleic acids are linear polymers of monomeric units called nucleotides. A nucleotide consists of a purine or pyrimidine base linked to C’-1 of a pentose (furanose) via an N•€ glycosyl bond and contains a phosphate residue attached to the sugar via an ester bond with a CH2OH group at the 5′ position. The linear polymer in both RNA and DNA is generated by a C’-3 ester linkage of 5′ nucleotides generating a 3′-5′ phosphodiester linkage (Fig. 1B).

There are several differences in the chemical structures of DNA and RNA. First is the nature of the pentose ring in these macromolecules, i.e., ribofuranose for RNA and 2′-deoxyribofuranose for DNA (Fig. 1A). Because of the presence of deoxyribose in DNA, the monomeric unit is called a deoxyribonucleotide or simply a deoxynucleotide, while the RNA monomer unit is called a ribonucleotide.

FIGURE 1 Structure of DNA and RNA: (A) structure of deoxyribonucleotides and ribonucleotides and (B) structure of polynucleotide. Each 3′ carbon of the sugar residue is linked to the 5′ carbon of the sugar residue in the next nucleotide with a phosphate to form the phosphodiester backbone. (C) Base paring of adenine with thymine (uracil) and guanine with cytosine. Dotted lines denote hydrogen bonding between two bases. R, pentose ring of nucleotide. (D) A three-dimensional structure of a DNA helix.

The term "nucleotide" is used generically for both RNA and DNA units. The absence of a 2′-OH group in DNA prevents alkali-mediated cleavage of the 3′-5′ phospho-diester cleavage observed in RNA and thus makes DNA more resistant to hydrolysis. Both RNA and DNA contain two types of purines, adenine (A) and guanine (G), and two types of pyrimidine bases (Fig. 1C). The second key difference between RNA and DNA is that while cyto-sine (C) is present in both RNA and DNA, RNA normally contains uracil (U), while DNA contains 5-methyluracil, called thymine (T), as the other pyrimidine base. The difference in chemical structure is reflected in the intrinsic chemical stability of these nucleic acids. The purine N-glycosyl bond in DNA is more unstable than in RNA, and as a result, purines are released much more easily from DNA by acid catalysis. Furthermore, cytosine deamina-tion to produce U also occurs at a finite rate in DNA.

Various processes have evolved to maintain the genomic integrity, as discussed later.

Finally, two other critical differences between DNA and RNA are in the length and structure of the polymer chains. DNA polymers, as elaborated later, usually exist as a helix consisting of two intertwining chains, while RNA is present mostly as a single chain. Furthermore, DNA could contain up to several billion deoxynucleotide monomeric units in the genomes of higher organisms, although the genomes of smaller self-replicating units such as viruses contain only a few thousand deoxynucleotides. In contrast, RNA chains are never more than a few thousand nucleotides long.

B. Base Pairing in Nucleic Acids: Double Helical Structure of DNA

The most important discovery in molecular biology was the identification of the right-handed double helical structure of DNA, where two linear chains are held together by base pair complementarity. This discovery by Watson and Crick in 1953 heralded the era of molecular biology, which was preceded by the rapid accumulation of genetic evidence indicating that DNA, as the genetic material of all organisms, is the primary storehouse of all their information. Exceptions to this fundamental principle were found in certain bacterial, plant, and mammalian viruses, in which RNA constitutes the genome. However, the viruses are obligate parasites and are not able to self-propagate as independent species; thus, they have to depend on their hosts, which have DNA as their genetic material. Thus, DNA in all genomes (except some single-stranded DNA viruses) consists of two strands of polydeoxynucleotides which are anti-parallel in respect to the orientation of the 5′-3′ phosphodiester bond in the polymers (Fig. 1D). The two strands are held together by H-bonding between a purine in one strand and a pyrimi-dine in the complementary strand. Normally, adenine (A) pairs with T and G pairs with C; A and T are held together by two H-bonds, and G and C are held together by three H-bonds involving both exocyclic C=O and ring NH (Fig. 1C). As a result, G^C pairs are more stable than A^T pairs. Because U is structurally nearly identical to T, except for the C-5 methyl group, U also pairs with A in the common configuration. Although H-bonds are inherently weak, the stacking of bases in two polynucleotide chains makes the duplex structure of DNA quite stable and induces a fibrillar nature in the DNA polymer. X-ray diffraction studies of the DNA fiber, and subsequent crystallo-graphic studies of small (oligonucleotide) DNA pieces, led to the detailed structural elucidation. This was initially aided by chemical analysis showing equivalence of purines and pyrimidines in all double-stranded DNA and equimolar amounts of A and T and of G and C (Chargaff’s rule), unlike in RNA, which is single stranded (except in some viruses). X-ray diffraction studies also showed that DNA in double helix exists in the B-form, which is right handed and has a wide major groove and a narrow minor groove. Most of the reactive sites in the bases, including C=O and NH groups, are exposed in the major groove (Figs. 1C and 1D). One turn of the helix has10 base pairs (bp) with a rise of 34°. Thus, each pair is rotated 36° relative to its neighbor. Elucidation of the structure of DNA bound to proteins show that one turn of the helix containing 10.5 bp could be significantly bent or distorted. For example, some DNA binding proteins bind to the minor groove, causing its widening accompanied by compression of the major groove. In some special regions of the genomes, e.g., in telomeres and segments with unusual repeated sequences, alternative forms such as triple helical structure and Z-DNA may exist. The Z-DNA has a left-handed, double-helical structure. In these or in tor-sionally stressed DNA, the bases can be held together by different type of H-bonding called Hoogsteen base pairing.

C. Size, Structure, Organization, and Complexity of Genomes

Except for certain viruses, DNA is the genetic material for all organisms and self-replicating units, including viruses and such intracellular organelles as chloroplasts (in plants), kinetoplasts (in protozoa), and mitochondria (in most eukaryotes). Genomic DNA is double helical (except for the genomes of certain bacterial viruses), and its size is related to the complexity of the organism (Table I). In subcellular organelles, viruses, and plasmids, the genome often exists as a circular molecule consisting of up to several thousand base pairs. The genome of bacteria, such as that of the widely studied enteric strain E. coli, is present as a single, circular, double-stranded molecule containing about 4.7 million base pairs. By and large, the genome of many small self-replicating entities is circular DNA, without any terminus in the unbranched polymeric chain.

In contrast, the large nuclear genomes of more complex organisms (from lower eukaryotes such as unicellular yeast with a genome size only an order of magnitude larger than that of E. coli, to mammals with genomes larger by three orders of magnitude) consist of multiple, distinct, linear subunits organized in chromosomes. Depending on the stage of the cell cycle, the structure of chromosomes (collectively called chromatin) varies from the highly extended and amorphous state occurring in much of the (interphase) nucleus to highly compacted, linear, organized chromosomes (metaphase) after completion of DNA duplication followed by cell division (mitosis). This complex organization of eukaryotic genomes is a distinctive feature which separates them from the prokaryotes.

D. Information Storage, Processing, and Transfer

The central dogma of molecular biology is that information is transferred from DNA to RNA to proteins. The proteins (which include the enzymes and structural components of cells) are directly responsible for most cellular activities and functions. The information needed for all functions of all organisms is stored in the genomic DNA sequence, which contains discrete units defined as genes. Each gene encodes a protein whose function and activity are determined by its primary sequence. The discovery of colinearity of the DNA nucleotide sequence and the amino acid sequence of the encoded polypeptide in prokaryotes and their viruses led to the discovery of the genetic code which postulates that a three-nucleotide sequence in DNA, called a codon, is responsible for insertion of a specific amino acid in the polypeptide chain during its synthesis.

Thus, the information content in the genomic DNA of a cell needs not only to be preserved and passed on to the progeny cells during replication, an essential characteristic and requirement of all living organisms, but also has to be processed and transferred via proteins to the ultimate cellular activities, including the metabolism.

Elucidation of the double-helical structure of DNA lends itself to an elegant but simple mechanism of perpetuation of the DNA information during duplication, called semi-conservative replication. In this model (Fig. 2), the two strands of DNA separate, and each then acts as the template for synthesis of a new daughter strand based on base pair complementarity and strand polarity. Thus, the two strands of the DNA double helix, though not identical in sequence, are equivalent in information content.

TABLE I Genomic DNA Characterized in Biology*

Organism	Structure	Total size (bp)	Number of genes	Sequence
Bacteriophage	Linear, circular	5 – 200 x 10³	10-100	Completed for many species
Virus		Up to 2 x 10⁵	10-100	Completed for many species
Bacteria E. coli	Circular	4.6 x 10⁶	-4300	Completed
Eukaryote
yeast (S. cerevisiae) Linear		1.4 x 10⁷	-6000	Completed
Drosophila	Linear	1.4 x 10⁸	1.4 x 10⁴	Partially completed
Human	Linear	3 x 10⁹	4 x 10⁴ to 1 x 10⁵	Partially completed

FIGURE 2 DNA polymerization reaction. (A) According to the base pairing rules, a deoxythymidinetriphosphate (dTTP) is added at the 3′-OH end of the top strand through a transesterification reaction catalyzed by a DNA polymerase. (B) Two units of DNA polymerase form a heterodimer complex to carry out replication in a semi-conservative way. Because the reaction goes only in the 5′ ^ 3′ direction, one side (the leading strand) is synthesized continuously, while the other (the lagging strand) consists of short DNA fragments (Okazaki fragment). DNA replication is initiated by an RNA primer (waved line) which is synthesized by a primase. There are a number of accessory but essential proteins besides the polymerase unit.

FIGURE 3 An RNA polymerase unit (filled circle), which consists of multiple factors, opens DNA helix (shown as a bubble) and synthesizes RNA in the 5′^ 3′ direction.

The intermediate carrier in the transfer of information from DNA to protein is the messenger RNA (mRNA), which is copied (transcribed) from only one of the two strands (Fig. 3), based on base pair complementarity (except for the presence of U in RNA in the place of T; Fig. 1C). In the synthesis of both DNA (replication) and RNA (transcription), the polynucleotide chains are synthesized by sequential addition of monomeric units (de-oxyribonucleotide for DNA and ribonucleotides for RNA) to the 3′ end of the growing chain (Fig. 3).

The mRNA is read out by ribosomes, the ribonucleo-protein complex which functions as the factory for protein synthesis. The codons are recognized as blocks because they code for specific amino acids. Thus, the linear polypeptide sequence is determined by the linear mRNA sequence.

E. Chromosomal DNA Compaction and Its Implications in Replication and Transcription

Metaphase chromosomes in cells undergoing mitosis are visible under the light microscope. Their formation requires some 104- to 105-fold condensation of uninterrupted linear duplex DNA which has a 2-nm diameter. Such compaction is accomplished in a highly complex and stepwise fashion. Because DNA is a polyelectrolyte with two negative charges per nucleotide, charge neutralization and shielding is required before the polymer can be folded in an ordered, condensed structure. In addition to metal ions and polyamines, the major source of the positive charge in chromatin is the family of highly basic small proteins, called histones, which are rich in the basic amino acid residues lysine and arginine needed to neutralize the charge of the phosphate backbone of DNA. The prokaryotes also have basicproteins (such as HU protein in E. coli) which induce DNA condensation. However, chromatin compaction in eukaryotes is carried out in stages. The simplest folded unit of DNA is the 10-nm nucleo-some, consisting of a core histone octamer containing two molecules each of histone H2A, H2B, H3, and H4 around which nearly two turns of the DNA is wrapped. The nu-cleosome cores are connected by a stretch of linear DNA (linker) of variable length which is covered by histone H1 or H5. The polymeric chain nucleosomes are then folded in a 30-nm fiber whose structure is stabilized by the interaction among histones and a number of other proteins collectively called nonhistone chromosomal proteins (NHC), including high mobility group (HMG), which are not particularly basic. Eventually, the fibers are condensed into highly compacted metaphase chromosomes. The nature of the interactions present in interphase and metaphase chromosomes is not clear.

However, the implications of this compaction are profound. It is absolutely essential to condense the mammalian genome, which in an extended linear form more than 1 m long, to a volume which can be accommodated in the nuclear volume of 10-30 femtoliters. At the same time, the genes will be buried in condensed chromatin, and yet their specific sequences need to be exposed for various processes of information transfer. Thus, for both transcription and replication, the chromatin has to be decondensed. This was evident in early in vitro studies which showed that both these processes are severely inhibited when DNA is complexed with histones.

F. DNA Sequence and Chromosome Organization

The massive human genome project should achieve its goal of determining the complete sequence of human and mouse genomes in the near future; a "rough draft" has already been obtained. Furthermore, this genome initiative, pursued by both government and private enterprises in the United States and other countries, has already culminated in elucidating the complete sequence of E. coli and other bacteria, as well as yeast, a nema-tode, and the fruitfly Drosophila melanogaste. Significant progress has been made in elucidating the nucleotide sequences of both human and mouse genomes by using a two-pronged approach. On one hand, the sequences of transcribed regions of the genomes are being deduced from sequences of randomly isolated mRNA segments reverse transcribed into DNAs. At the same time, complete DNA sequences of fragments of whole chromosomes are being directly determined. This has opened up a huge scientific challenge of deciphering the genetic information, identifying unknown genes and their encoded proteins, and the variability of gene sequences with corresponding changes in the protein sequences in individuals. Functional genomics is a newly created discipline which deals with the deterministic prediction of protein functions from the primary sequences. One extension of such analysis is to ascertain the consequences of al-lelic polymorphisms in the human genome, i.e., minor changes in the sequences of cellular proteins which do not cause an explicit pathological phenotype and yet may affect survival and predisposition to specific diseases in the long term.

G. Repetitive Sequences: Selfish DNA

Even before the precise genome sequences are elucidated, one unique feature of the metazoan DNA sequence has been established from a number of studies. A large fraction (perhaps up to 90% or more) of the total genomic sequence in metazoan cells do not encode any information. Some of these sequences are present as noncoding intervening regions in genes, named "introns," which do not code for proteins. However, the intron sequences are transcribed but are removed during processing ("splicing") to generate mature mRNA, as discussed later. Many of the other genomic sequences are not even transcribed, and these may often be present as multimeric repeats of shorter units. These repetitive sequences have no known function in the cell, yet are maintained and replicated as an integrated part of the genome; such DNA is referred to as "selfish DNA."

Metaphase chromosomes are organized in substructures distinguished by their staining with dyes. Euchromatin regions contain transcribed sequences, while heterochro-matin regions contain large segments of repetitive sequences. Metaphase chromosomes are also characterized by specific stained sequences (named centromeres) in the middle of the elongated structure, in addition to telomeres at the termini, as discussed earlier. Both centromeres and telomeres have unique repetitive sequences, and in some cases similar sequences have been observed in other regions of chromosomes; these regions are highly condensed and not transcribed.

H. Chromatin Remodeling and Histone Acetylation

In order to make the DNA template available for both replication and transcription, the chromatin is "remodeled." One way to accomplish this reversible process is by altering the electrostatic interaction with histone. Acetylation of lysine residues (and to some extent phosphorylation of serine and threonine residues) reduces the binding affinity of histones with DNA in nucleosome cores and may thus allow exposure of free DNA to the transcriptional machinery. Additionally, a more complex energy-driven process involving the proteins SNF1 and SWI causes a major alteration of the chromatin structure, which is necessary for reprogramming of the transcriptional regimen during growth, development, and associated differentiation. DNA replication also requires access of DNA in free form to the replication machinery and, therefore, may also be dependent on the same remodeling process and could even require dissociation and reassociation of the nucleosome core.