Peptide Mass Fingerprint To Rotamer (Genetics,Genomics,Proteomics,Bioinformatics)

Peptide Mass Fingerprint
Analysis of a protein by mass spectroscopy produces a series of masses of the peptides that were generated from the original protein by protease cleavage. Knowing the mass series and the protease used, it is often possible to identify the protein. The mass series is referred to as a peptide mass fingerprint (or mass fingerprint) for the protein concerned. Mass fingerprinting cannot be perfectly reliable because of, for example, the existence of isobaric residues: sequencing at least parts of the fragments is often needed to fully identify the protein.

Peptide Sequence Tag

A short string of peptide mass differences corresponding to a peptide sequence that can be used to identify a longer protein. In a technique developed by Matthias Mann and Matthias Wilm at EMBL in the 1990s, mass spectra of protein fragments derived from MS/MS analysis of proteomics experiments are searched for the presence of sequence tags, and these are used to identify the original protein.


The regions of eukaryotic chromosomes that immediately flank the centromere. Like centromeres, pericentromeres contain a large proportion of repetitive sequences and few genes. The pericentromere is a structural domain of the chromosome that is essential for chromosomal segregation; it has been implicated in the cohesion of the chromosome pairs.


A programming or scripting language that is particularly useful for interpreting and reformatting large quantities of textual data. It is available for all common computer platforms; it is regarded as being easy to learn and use and is the programming language of choice of most bioinformatics professionals who were not trained as programmers. Libraries of Perl scripts for bioinformatics tasks (e.g., BioPerl) have been made available.


Trivially, any virus that infects bacteria. Phages are simple viruses, consisting of a core of either DNA or RNA surrounded by a protein coat. Some phages are virulent; infection with a virulent phage inevitably leads to viral replication, and the death and lysis of the host cell. Other phages, known as temperate phages, may insert their DNA into the host chromosome where it remains transcriptionally silent. Phages have many uses in modern molecular biology. Source: Kahl, G, The Dictionary of Gene Technology.

Phage Display

A technique for the presentation of distinct proteins or peptides on bacterial surfaces, using bacteriophages as carriers. Genes for the proteins to be displayed are integrated into the phage genome, and the proteins expressed as fusions with a viral coat protein. This exposes the display proteins on the bacterial surface. The technique enables the identification of proteins with particular binding properties.


The influence of genetics and, especially, of genetic variation, on pharmacology: that is, how differences in people’s genetic makeup influence their response to drugs. One important aspect of this is the effect of common polymorphisms in the P450 protein family on drug metabolism on the optimum dose of each drug for different individuals.


The observable characteristics of an organism. These may be structural, functional, or (with higher animals, particularly man) behavioral, and they derive from both the organism’s genetics and its environment.


Generally, the addition of a phosphate (PO3-) group to any (usually organic) molecule. In proteomics, the addition of a phosphate group to the hydroxyl group of a serine, threonine or tyrosine residue of a protein. Protein phosphorylation is controlled by the large kinase family of enzymes and is extremely important in cellular signaling pathways.

Phylogenetic Footprinting

A bioinformatics technique for identifying regulatory elements in DNA by locating regions of orthologous noncoding DNA that show unexpectedly high conservation between species.

Phylogenetic Marker

A gene, coding for either RNA or protein, that can be used for phylogenetic analysis because changes in its sequence can be consistently followed throughout a relevant period of evolutionary history. Genes that are highly conserved throughout long evolutionary distances, such as RNA genes and certain essential proteins such as vacuolar ATPases (ubiquitous in eukaryotes) and cytochromes, are commonly used as phylogenetic markers.

Phylogenetic Profile

A binomial string that describes the presence or absence of a particular gene in all fully sequenced genomes – thus, if a gene is present in a species’ genome a “1″ will be entered in that position in the string, whereas if it is not, a “0″ will be entered. As proteins that take part in, for instance, the same metabolic pathway or process are likely to evolve in a correlated fashion, proteins with similar phylogenetic profiles are thought likely to be functionally related.

Phylogenetic Tree

A tree diagram showing the evolutionary relationships between species, or between genes or proteins in a family, that are believed to have a common ancestor. The edge lengths of the “branches” correspond to estimates of the distance between the entities in evolutionary time. In a rooted tree, there is a unique node at the bottom of the tree that represents the (putative) most recent common ancestor of the entities at the “leaves”.


The use of molecular evolution (phylogeny) to help deduce the function of proteins. These techniques rely on the fact that genes that have diverged in speciation events (orthologs, e.g., human hemoglobin and mouse hemoglobin) are generally closer in function than genes that have diverged in duplication events (paralogs, e.g., human hemoglobin and human myoglobin). Phylogenetic analysis is used to deduce orthologs of genes of unknown function and information from those used in the annotation of the new genome.


A piece of closed, circular, autonomously replicated, double-stranded DNA. Plasmids range in size between 1 and >200 kb. They are found mainly in bacterial cells, with copy numbers from one to several hundred per cell. They are one of the main means of horizontal gene transfer (and so the transfer of traits such as antibiotic resistance) between prokaryotes; modified plasmids are used in the construction of cloning vectors. In eukaryotes, plasmids may be found in mitochondria and plastids.


A gene is defined as pleiotropic if mutations in that gene have different clinical effects. For example, mutations in the gene for fibrillin-1, located on human chromosome 15, cause Marfan syndrome, but that syndrome may bave strikingly different clinical effects, involving one or more of the skeletal, ocular, and cardiovascular systems. The fibrillin-1 gene is therefore described as strongly pleiotropic. The disease or condition concerned (in this case Marfan syndrome) may also be described as pleiotropic.

Poisson Distribution

A probabilistic distribution used in statistical analysis to predict the likelihood of success of a trial in situations in which a large number of trials have been conducted but the probability of success in each individual trial is small. In bioinformatics, it can be applied, for example, to the probability of two sequences chosen at random having a similarity score similar to one that could be expected with sequences that have a common ancestor.


The phenomenon in which a nonsense mutation introduced into a gene transcribed early in an operon has the secondary effect of repressing expression of nonmutated genes downstream of the mutated gene. The mutation involved is termed a polar mutation or dual effect mutation.

Poly-A Tail

A sequence of 60-200 adenine nucleotides added to the 3′ end of most eukaryotic mRNAs after transcription by a template-independent poly(A) polymerase. Its role is to add stability to the mRNA. Source: Kahl, G, A Dictionary of Gene Technology (Wiley-VCH, 2001).


An mRNA is said to be polycistronic if it contains the transcript of more than one gene, expressed under the control of a single set of transcription factors. Generally, a polycistronic mRNA will contain the transcripts expressed from a single operon (an example being the lac operon in E. coli). Most often, the proteins are synthesized separately, but sometimes the entire message will be transcribed into a polyprotein.

Polymerase Chain Reaction

A technique used for the selective amplification of a region of target DNA between two annealed primers, by the DNA polymerase-driven extension of those primers in the 5′ to 3′ direction. It initially uses target DNA as the sequence template. The target DNA is first heated with an excess of primers, nucleotides, and DNA polymerase to over 93°C to separate the strands. It is then cooled, and the primers anneal to the original DNA. When the temperature is raised again, the polymerase catalyzes the extension of the primer strands. This produces two new duplexes and the cycle then repeats with the system being heated again to break the hydrogen bonds in the new duplexes.


Any specific change in a DNA sequence that is found in some individuals, leading to heterogenicity in a population. This change in genotype may or may not lead to a change in phenotype. Polymporphisms may be in coding or noncoding DNA and may consist of deletions, insertions, inversions, genetic rearrangements, or single base changes. The last named are known as Single Nucleotide Polymorphisms or SNPs.


In phylogeny, a taxonomic group is defined to be polyphyletic if it is not monophyletic or paraphyletic. A monophyletic group (or clade) consists of a single organism plus all its descendents; a paraphyletic group is a monophyletic group minus one or more distinct subclades. All other groupings (polyphyletic groups) are considered to be unnatural assemblages and are not used in phylogeny, even if there is a phenotype common to the organisms. An example of a poly-phyletic group is the group of warm-blooded animals (mammals + bird).


A complex formed between cationic polymers and DNA, used in nonviral vectors for gene therapy. Complexes formed by cationic lipids for the same reason are termed lipoplexes. The DNA in both these types of complexes is protected from degradation by nucleases. The linear polymer poly-L-lysine was the first cationic polymer to be used in this type of gene delivery, in 1998.


Membrane proteins that are embedded in a cell or organelle membrane, that is, that cross the membrane more than once, and in contrast to single-term membrane proteins. The term is usually reserved for the alpha-helical type of membrane proteins, which are not found in the outer membranes of Gram negative bacteria.

Population Isolate

A population, generally of humans but potentially of other species that has been genetically isolated by geography and lack of outbreeding and that will therefore exhibit less genetic heterogeneity and a higher degree of linkage disequilibrium. Population isolates are very useful for the study of genetic diseases, as mutations may accumulate leading to unusually high prevalence of certain genetic diseases.

Population Structure

In population genetics, a population has a structure if its distribution of genetic material is nonrandom. If population structure is undetected, genetic association studies can give both false-positive and false-negative results. Studies of common, multigenic disorders are particularly prone to this problem.

Position-specific Scoring Matrix

A matrix of numbers representing the likelihood of finding a particular base or amino acid at each position of a domain or motif. Each row of the matrix represents a base or amino acid type and each column represents a position in the motif or domain sequence, from the first to the last. The values in the matrix give the log odds of finding each residue at each position. These matrices are used to select regions that are similar to the sequence family modelled. In this method, gaps are not allowed in the motifs modeled.

Positional Cloning

The cloning of a specific gene in the absence of a transcript or a protein product, using genetic markers tightly linked to the target gene and a direct or random chromosome walk by linking overlapping clones from a genomic library.

Positive Inside Rule

A rule that states that the segments or loops of a polytopic membrane protein that lie inside the cell (i.e., in the cytoplasm) contain more positively charged residues than those that lie outside the cell, in the periplasm or the extracellular medium. It is often used with hydropathy analysis to predict the number, location and topology of helices in these proteins. TopPred and TMHMM are examples of publicly available algorithms that use this rule in their prediction of transmembrane helix topology.

Posterior Probability

In Bayesian probability theory, the conditional probability of an event when empirical data has been taken into account. It may be calculated from the prior probability and the likelihood using Bayes’ Theorem.

Posttranslational Modification

Any chemical modification to a protein that is made once the protein has been transcribed from its mRNA. There are thought to be several hundred different post-translational modifications, ranging from cross-linking with disulphide bonds and simple glycosylation and phosphorylation to the covalent binding of complex cofactors. The large number of posttranslational modifications in higher eukaryotes is one reason why their proteomes are much larger than their genomes.

Preinitiation Complex

Trivially, the protein-DNA complex that is assembled prior to the transcription of a gene. In practice, the assembly of all components of the basal transcriptional machinery – that is, the complex of universal nuclear proteins, comprising RNA polymerase II(B) and transcription factors – on the core promoter. The assembly of the preinitiation complex initiates transcription. Source: Kahl, G, The Dictionary of Gene Technology.


Any complete primary transcript of a structural (protein-coding gene) before it is modified to form the mature transcript, which is, in turn, translated into protein. In eukaryotes, pre-mRNA includes transcripts of the exons as well as the introns. Spliceosomes – small organelles made up of protein and RNA – excise the introns and add new noncoding sequences to the 5′and 3′ ends of the DNA.

Premutation Allele

An allele of a gene for one of the so-called triplet expansion diseases (e.g., Huntington’s disease, Fragile X syndrome) that is toward the high end of the phenotypically normal range. Individuals carrying premutation alleles are at greatly increased risk of passing on a defective allele to their offspring as a result of further expansion. For example, individuals with normal alleles for Fragile X sydrome carry between 6 and 50 CGG repeats in that gene, individuals with premutation alleles between 50 and 200 repeats, and affected individuals often well over 200.


The number, or percentage, of cases (generally but not necessarily of disease) present in a population at a given time. This is to be compared with the incidence of the disease, which is the rate of occurrence of new cases of the disease during a given period. A chronic and relatively benign disease such as asthma or arthritis will have a much greater prevalence than incidence.

Primary Structure

The amino acid sequence of a protein. In practice, the term “primary structure” is only used as a synonym for sequence in structural proteomics, where it is viewed as the first grouping in the hierarchy of protein structure classification, coming before secondary structure (alpha helices and beta strands), tertiary structure (the fold of a single polypeptide chain) and quaternary structure (the arrangement of chains to form a functional protein).


A short, generally synthetic oligonucleotide that is complementary to part of a larger DNA molecule. Primers form the 3′ end of substrates onto which DNA polymerases can add nucleotides to grow a new DNA chain. Primers are used as templates in the polymerase chain reaction, and so must be chosen carefully if only the correct sequence of DNA is to be amplified.


A way of representing a multiple sequence alignment numerically as a matrix of scores, where each score represents the probability of finding a particular amino acid (or base) at a particular position in the profile. Profiles are often used for classifying protein domains into functional families; they can be used to model DNA sequence alignments, but this is much less common. Some databases, such as Pfam and Smart, use profiles generated using hidden Markov models.


A region of DNA located upstream of the initiation site, to which RNA polymerase binds to initiate transcription. The sequences of prokaryotic promoters, and of eukaryotic promoters that bind different types of RNA polymerase, have very different sequences. Promoter sequences of a particular type have quite divergent sequences but are characterized by specific short sequence patterns: for example, prokaryotic promoters have the so-called “Pribnow box” sequence at approximately position -10 and eukaryotic promoters that bind RNA polymerase II have the “TATA box” sequence at approximately position -25.


In the cell cycle, the first phase of cell division during which DNA replication occurs and the chromosomes condense. By the end of prophase, the chromosome pairs are visible under the light microscope, with each pair of daughter chromosomes held together by the centromere. Details of the chromosomes, including abnormalities, can be viewed easily during prophase.


An enzyme that breaks peptide bonds by hydrolysis, thus breaking a protein into peptides. Most proteases are specific, that is, they only break bonds before and/or after particular patterns of amino acids. They have been divided into four main families based on the functional groups in their active sites: the aspartic (or acid) proteases, cysteine proteases, serine proteases, and zinc (or metallo-) proteases. In proteomics, proteases are used to break separated proteins into peptides prior to identification.

Protein Interaction Map

A map showing the complex network of interactions between (preferably) a large subset of the proteins expressed in a given cell type at a given time. Protein interaction maps may be generated using two-hybrid technology.

Protein Microarray

An array of probes that is used, by analogy with cDNA microarrays, to determine which proteins are present in a sample. A signal is detected whenever a protein binds to a probe (which may be, for example, an antibody). Protein microarrays much less advanced, technologically, than cDNA microarrays, but they are now becoming more available.

Protein Profiling

Any technology that is used to quantify the expression level of every protein in a tissue sample may be described as a protein profiling technology. It is, essentially, the equivalent in proteomics of the DNA microarray in transcriptomics. The technologies involved are still very much in development but some of the most promising developments involve arrays of spotted antibodies or spotted protein antigens.

Protein Trafficking

The processes by which proteins synthesized in a cell nucleus move through a cell to their eventual destinations – within the cytoplasm or an organelle, embedded in a cell or organelle membrane, or secreted from the cell – are generically known as protein trafficking. The endoplasmic reticulum and Golgi bodies are involved in trafficking. Some parts of protein sequences, such as signal peptides, may determine protein location.


A particular type of glycoprotein (or protein-saccharide conjugate) that is heavily glycosylated, that is, has a high proportion by mass of saccharide. Proteoglycans always consist of a core peptide chain with one or more linear chains of glycosaminoglycans that have sulphate and/or urate groups attached and so are negatively charged. Proteoglycans can have a variety of forms and functions.


The breakdown of a protein into peptides by a protease. There are hundreds of different proteases known, each with a different specificity. In complete proteolysis the protein is broken down into its constituent amino acids. Proteolysis is a natural function that occurs in all organisms, even viruses, but it is also an important part of many scientific analyses. In core proteomics methodologies, separated proteins are broken down into short peptides by proteolysis before mass analysis. Trypsin is one enzyme that is very commonly used for this.

Proteolytic Peptide

Peptides that are produced from the digestion of a protein by a protease trypsin are termed proteolytic peptides. The digestion of proteins into fragments, using proteases – most often trypsin (hence tryptic peptide) – is the first step in the procedure of protein identification using mass spectrometry. Trypsin cleaves preferentially after the positively charged amino acids lysine and arginine.


Any viral DNA that becomes an integral part of the host cell chromosome and is therefore transmitted from one cell generation to another without lysis of the host cell. A retrovirus that has been integrated into a host chromosome is an example of a provirus. Similarly, a prophage is bacteriophage DNA that becomes integrated into the chromosomal DNA of a bacterial host. Source: Kahl, G, The Dictionary of Gene Technology.


A nonfunctional derivative of a functional gene that has been inherited by an organism but is no longer needed. During evolutionary history, the sequences of pseudogenes mutate to prevent normal gene expression, for example, by changing promoter region sequences or inserting stop codons. Pseudogenes often retain significant similarity to functional genes in other species and so may be found by homology-based gene finding programs.


A complex structural interaction between local regions of an mRNA molecule in which one strand of an RNA hairpin is folded back on itself to form, first, a second loop and then a series of base pairs with bases in the loop region of the first hairpin.


A form of BLAST in which the database is searched with a profile of sequences rather than a single sequence. The first cycle of PSI-BLAST is a traditional BLAST run; then a profile is constructed from the sequences that match the first sequence and a second cycle of BLAST run using that profile. The process is then repeated until no further sequences are added in a run, at which point the run is said to have conserved. PSI-BLAST can only be run with protein sequences. It is a sensitive method of finding distant homologs.

Quadrupole Ion Trap

A type of mass spectrometer that has been developed relatively recently and which is both sensitive and versatile. Ions are focused into the ion trap machine using electrodes, and the ions are injected into the trap using an electrostatic ion gate, which pulses open and closed. The ion trap is filled with helium. The kinetic energy of the ions is reduced by collisions with helium atoms, and the ions are trapped with their movement depending on their mass and charge. This allows the precise determination of mass/charge ratios.

Quantitative Trait Locus

A genomic region that contains two or more genes or two or more separate genetic loci (map positions) that are known to contribute cooperatively to the establishment of a specific phenotype or trait. Source: Kahl, G, The Dictionary of Gene Technology.

Quaternary Structure

The highest level in the hierarchy of protein structure. Quaternary structure describes the association of more than one separate protein chain into an active protein complex, held together by disulfide bonds or noncovalent interactions. The association of the four globin chains that make up hemoglobin is a simple example.

Radiation Hybrid Map

A dense map of a mammalian chromosome that is created with a somatic cell hybrid technique. Pairs of genes are localized by using two gene-specific primers to amplify a single PCR product, which is then given a radioactive label and hybridized to a panel of radiation hybrid clones. Hybridization indicates that sequences complementary to the PCR product are present in a radiation hybrid clone.

Ramachandran Plot

A plot of the two torsion angles that, between them, describe the backbone conformation of an amino acid within a protein against each other. Generally phi, the torsion angle about the N-CA bond, is plotted on the x axis and psi, the torsion angle about the CA-C bond, on the y axis. Each amino acid will therefore be represented by a single point. The positions that may be adopted in real proteins are limited by steric hindrance to regions corresponding to alpha helices and beta strands, and a smaller, less populated region corresponding to the left-handed alpha helix.

Read Length

The length (in bases) of the small pieces of DNA that are sequenced using Sanger’s sequencing techniques and then assembled into longer pieces (and eventually into chromosome and genome sequences) is termed the read length. Computer programs are used to assemble the resulting short sequences. A typical read length in many sequencing projects is 500 bases.

Reading Frame

The position from which the codons defining amino acids are read when a DNA sequence is translated into protein. As codons contain three bases, each strand may be read in three ways, so, for instance, a sequence beginning ACGT … may be read starting from the A, the C and the G; if the sequence starts from the T, it is read in the first reading frame with the first codon (amino acid) missing. Any gene sequence may be read in six reading frames, three on the forward strand and three on the reverse strand.

Real Time PCR

A method of monitoring the amount of a gene produced during the polymerase chain reaction (PCR) using a fluorescent reporter. The amount of fluorescence, which can be measured directly in real time, is directly dependent on the amount of reporter and thus on the amount of amplicon present. It can thus be used throughout the PCR reaction, including the exponential phase, and not just at the end. Real time PCR is sensitive, specific, and reproducible over a wide range of concentration ranges.


A protein that recognizes another molecule (known as its ligand) and becomes activated when that ligand binds. This activity may take one of a number of forms, including, for instance, conformational changes and binding further molecules. The genomes of free-living organisms contain genes for many hundreds, if not thousands, of different receptors. They are probably the most important large family of protein drug targets. Drugs targeted at receptors may duplicate the receptor activity (agonists) or block the receptor site preventing activity (antagonists).


A recessive trait or phenotype is one that is not expressed unless an individual carries two alleles for that trait. Many genetic conditions, including the relatively common cystic fibrosis, are inherited as recessive traits. The existence of dominant and recessive traits was one of the key discoveries of Gregor Mendel, which led to the invention of genetic mapping. It is now known to be an oversimplification.

Recombination Hot Spot

The rate of recombination is the rate at which genes are combined in a “child” cell or organism in a different pattern from that in which they are found in either of the parents, for example, due to exchange of DNA between chromosomes. The rate of recombination is not constant throughout a genome, or even an individual chromosome: areas of the genome where recombination is particularly common are known as recombination hot spots. (Similarly, regions where recombination rates are low are known as cold spots.)


Generally, a code is said to be redundant if part of it is unnecessary because more than one entity in the code maps on to the same entity in the translation. In molecular biology and bioinformatics, the genetic code is said to be redundant because 64 different codons code for 20 amino acids plus the stop signal. The number of codons that code for individual amino acids ranges between one and six.


A mass analyzer, used in mass spectrometry and thence in proteomics, that focuses a beam of ions by reversing the direction of the ions using a retarding electric field. The result is a reduction in the spread of kinetic energies in the ion beam. Types of reflectrons available include single-stage (the simplest), dual-stage, quadratic, and curved-field.

Regular Expression

An expression of characters, normally alphanumerics and symbols, that can be matched automatically using pattern matching software. Regular expression matching is commonly used in bioinformatics algorithms, in, for example, searching for amino acid or base patterns. The Unix tools sed, awk, and grep use regular expression matching and the programming language Perl has been optimized for this programming task.


A gene coding for a protein, known as a repressor, that blocks the activity (DNA binding) of an operator and so prevents transcription of the adjacent operon. In these cases, transcription is often induced by effectors that bind to the repressor proteins, causing changes to the repressor structure and preventing its binding.

Regulatory Network

A network of interactions between genes in which the condition represented by the edges is regulation; that is, two genes (nodes) are joined by an edge if the expression of one regulates the expression of the other. It is self-evident that all regulatory networks are directional.

Relational Database

Any database that is built using the relational model is termed a relational database; the best-known commercial example is probably Oracle. A relational model is a logical data structure defined using set theory, so each data item is a member of one or (usually) many more than one set. It can be stated more simply by saying that the data is collected in tables that are linked using keys, so relationships may be modeled across tables.

Replication Competent

Simply, a piece of DNA that is able to replicate is replication competent (the opposite being replication deficient). The term is often used in virology and gene therapy; a virus vector for gene therapy that is replication competent will be able to multiply and distribute the introduced gene round the body.

Replication Fork

Trivially, a region of genome sequence that starts with an initiation (START) codon and ends with a termination (STOP) codon, and so is translated into protein. A scan of a genome sequence for long ORFs is the first and easiest stage of gene prediction. In practice, the situation is much easier in prokaryotic genomes than in eukaryotic genomes, which are complicated by the extreme length of some genes, the presence of introns, and the necessity of identifying splice sites.


A segment of DNA under the control of a single replication-initiation locus and behaving as an autonomous unit during DNA replication. Whole plasmids and bacterial chromosomes are replicons. In eukaryotes, the number of replicons tends to increase with increasing genome size and organism complexity (e.g., yeast: 500 replicons, average size 40 kb; mouse: 25 000 replicons, average size 150 kb). Source: Kahl, G, The Dictionary of Gene Technology (2nd edition).

Reporter Gene

A network of interactions between genes in which the condition represented by the edges is regulation; that is, two genes (nodes) are joined by an edge if the expression of one regulates the expression of the other. It is self-evident that all regulatory networks are directional.


A protein that binds specifically to the regulatory sequence of an operator gene, blocking the movement of the RNA polymerase along the operator DNA and therefore blocking the initation of transcription. The affinity of repressor proteins can be modulated by small molecules that are known as effectors. Many repressors use the helix-turn-helix motif for binding the operator DNA.

Restriction Enzyme

An enzyme that recognizes specific short target sequences in double-stranded DNA and catalyzes the formation of double-strand breaks.
Restriction enzymes are natural enzymes that protect cells from foreign DNA, and are frequently used in molecular biology. Many different restriction enzymes that are known to recognize different oligonucleotides are used to detect small differences between DNA sequences.

Restriction Fragment Length Polymorphism

A polymorphism in which different alleles have different sequences at one or more restriction enzyme cut sits, so that in at least one case, a cut site is added or removed. Cutting the gene sequence using the restriction enzyme concerned will therefore produce fragments of different lengths that can be easily identified. This is a simple and cost-effective way of detecting polymorphisms.


A member of a class of viruses that infect eukaryotic cells and that has single-stranded RNA as its genetic material. After the virus infects a cell, its RNA is reverse-transcribed into a copy of the eukaryotic DNA by the enzyme reverse transcriptase; the integrated (endogenous) retrovirus is termed a provirus. When the endogenous retrovirus is transcribed viral proteins that can associate into new virus particles are formed. Human immunodeficiency virus (HIV), which causes AIDS, is the best known retrovirus.

Reverse Transcription

The process, catalyzed by the enzyme reverse transcrip-tase, by which a double-stranded DNA molecule is transcribed using a single-stranded RNA molecule as a template and a primer. Reverse transposition is used in recombinant DNA technology for the synthesis of cDNA from messenger RNA. It is also the process by which retrovirus DNA is integrated into the eukaryotic genome.

Risk Factor

Any feature that is known to increase a person’s chance of developing a disease is termed a risk factor. Risk factors may be lifestyle related (e.g., smoking, obesity, sun exposure) or genetic; well-known examples of the latter are the defective alleles of the BRCA1 and BRCA2 genes, which convey a greatly enhanced risk of developing breast or ovarian cancer.

RNA Interference

The silencing of a specific gene (i.e., the blocking of gene expression) by micro-injection single- or double-stranded RNA that is complementary to the gene to be silenced into cells. It is used in comparative genomics for determining the function of a gene. The injected RNA may in some circumstances be transmitted to germline cells and observed in the experimental organism’s progeny. RNA interference is also a natural mechanism for silencing gene expression.


A bioinformatics method or algorithm is described as robust if is reliable and both sensitive and specific, that is, if it predicts features (for example) with few false positives and false negatives. Signals within sequences, defined as patterns or profiles, can also be defined as robust if they predict family members with great accuracy. Generally, increasing the number of sequences that contribute to a pattern or profile will increase its robustness.

Root Mean Square Deviation

A measure of the similarity of two structures as the square root of the mean of the squares of the (scalar) distances between selected points. In an analysis of protein structures, the points chosen are self-evidently the atoms; generally, only main chain or alpha-carbon atoms are used. The root mean squared deviation between a model protein structure and the experimentally determined structure of the same protein is a very useful measure of the quality of the prediction; a good model based on a close homolog may have a RMSD of less than 1 Angstrom from the experimental structure.


The side chains of amino acids in protein structures are restricted by steric hindrance and therefore preferentially take up certain conformations, which are known as rotamers. Libraries of rotamer conformations are included in programs for three-dimensional protein structure determination and homology modeling, where they are used to suggest likely side chain positions.

Next post:

Previous post: