Exon-intron Structure To Locus Control Region (Genetics,Genomics,Proteomics,Bioinformatics)

Exon-intron Structure

The structure of a eukaryotic gene, in terms of the number, order and size of the coding exons and the noncoding exons that separate them. Eukaryotic genes vary widely in exon-intron structure, from single exon genes, through genes with a simple structure such as insulin with its single intron, to genes where over 90% of the genetic material consists of introns. It is even possible for one gene to be embedded in an intron of another gene.

Exon Skipping

The elimination of one or more exons from a transcript during splicing, such that the combination of exons remaining results in a different mRNA and hence in a translated protein with a different arrangement of domains. For example, a gene with three exons, A, B, and C may give rise to a protein containing domains coded from only exons A and C as a result of the skipping of exon B. Source: Kahl, G, The Dictionary of Gene Technology.

Expert System

A form of artificial intelligence in which the knowledge associated with an area of human expertise is codified into a set of rules that are then applied to the analysis and classification of data. Like artificial intelligence, more generally, expert systems are less common in bioinformatics than they were 20 or 30 years ago. They do, however, have some applications there, and they are more common in clinical applications.

Expressed Sequence Tag

A short, synthetic oligonucleotide of 300-500 bp, complementary to the 5′ or 3′ end of a specific mRNA and usually derived from a cDNA library by random sequencing. ESTs represent tags for the state of gene expression in a given cell type at a given time (disease status and/or developmental stage). Millions of sequenced ESTs have been deposited in public databases: ESTs represent the largest subdivision of the EMBL and GenBank databases. Source: Kahn, G, The Dictionary of Gene Technology (Wiley-VCH, 2001).

Expression Cassette

A DNA fragment (usually synthetic) into which foreign DNA can be cloned and expressed. The expression cassette is usually part of an expression vector and encodes a control region (e.g., promoter) with an adjacent Shine-Dalgarno sequence (for expression in prokaryotes), a signal peptide sequence if necessary, a polylinker, and a termination sequence. Source: Kahl, G, The Dictionary of Gene Technology.

Expression Profile

The simultaneously measured levels of many thousands of mRNAs (expressed genes) in a cell or tissue type, detected using a microarray. The classic microarray experiment measures differences between the expression profiles of different cell types, or the same type under different conditions.

External Spike-in Control

A standard pool of cDNA sequences showing no sequence identity to the human genome (or the other genome under analysis) used as controls in microarray experiments. The samples, which span a large concentration range, are added to the RNA samples before labeling and can be used to test for efficiency of cDNA synthesis/labeling, uniformity of hybridization, and sensitivity of detection.


A bioinformatics tool for searching a gene or protein sequence database with a single sequence. FASTA is complementary to BLAST; it is slightly more precise, but runs much more slowly. The database is scanned for those sequences that contain the largest number of perfect matches to short subsequences of the test sequence, and each of these best matches is then aligned more precisely. FASTA has also given its name to the most commonly used sequence file format, where the sequence title is given on the first line preceded by a greater-than sign.


A pattern of peptides obtained from a protein by proteolytic degradation (very often using the protease trypsin) and then separated and identified using mass spectrometry and optionally peptide sequencing. The peptide fingerprint produced by a particular protease is characteristic for each protein and can be used for protein identification.


A technique used to study the details of an individual’s chromosomes, and particularly to determine chromosomal abnormalities either before or after birth. It involves labeling segments of single-stranded DNA (known as probes) with fluorescent dye. If the cells of the individual under study contain chromosomal DNA that is complementary to the probes, they will bind and the fluorescent signal will be detected. Unlike most techniques for chromosomal analysis, it does not need to be performed on dividing cells.

Fluorescence Resonance Energy Transfer

A technique used in many pro-teomics applications, for example, in probing the structure of proteins, and protein-protein and protein-ligand interactions. It uses the fact that energy can, under certain circumstances, be transferred between two dye molecules (donor and acceptor) in close proximity without the emission of a photon. The absorption spectrum of the acceptor must overlap the fluorescence emission spectrum of the donor and the dipoles of the two molecules must be approximately parallel. It can be used to determine intermolecular distances of the order of tens of Angstroms.


Any molecule that produces fluorescence and that can therefore be used as a probe, typically in proteomics experiments. In proteomics, fluorescence may be described as intrinsic or extrinsic. Intrinsic fluorophores may be fluorescent molecules that are naturally bound to the protein, perhaps as cofactors, or amino acids with intrinsic fluorescence. Green fluorescent protein contains the unusual intrinsic fluorophore of a STG tripeptide, which is posttranslationally modified to a 4-(p-hydroxybenzylidene)-imidazolidin-5-one. Extrinsic fluorphores are fluorescent molecules that may be artificially bonded to proteins.


The level in protein structure classification that groups together proteins with the same secondary structure elements connected in the same order. The FOLD level in the SCOP database equates to the Topology level in CATH. Proteins with the same fold may have a common evolutionary origin, but homology cannot be assumed, particularly with the highly populated so-called superfolds (examples being the alpha-beta barrel and the immunoglobulin fold).

Fold Prediction

A method for predicting the tertiary (three-dimensional) structure of a protein that does not necessarily require the structure of a homologous protein to be available. It involves aligning the sequence of the protein to be modeled with known protein structures and using “threading” or a similar algorithm to select structures that are most compatible to the test sequence. It is a low precision method and cannot be used to predict novel folds, but it has had some notable successes, particularly in selecting remotely homologous structures as templates for homologous modeling.

Force Field

A self-consistent representation of a molecule or molecular system (e.g., protein, ligand, and solvent molecules) using Newtonian mechanics, which can be used for energy minimization or molecular dynamics simulations. Atoms are represented by points with size, mass, and partial atomic charge, and bonds as “springs” separating them. Bond lengths, bond angles, and torsion (twist) angles are maintained close to optimum positions using energy terms, and other energy terms define the nonbonded forces between atoms. Given the crudity of this model, it can produce some surprisingly good results, but these techniques must be used with caution.

Founder Population

The small population that first invades an isolated area such as an island. Descendents of a founder population will exhibit reduced genetic variation as a result of this population bottleneck. Human communities that descended from founder populations may exhibit unusually high prevalence of Mendelian diseases.

Fourier Transform

A mathematical transform that expresses a function as a sum or integral of sinusoidal functions multiplied by constants known as amplitudes. There are many distinct forms of Fourier transforms, which were named for French mathematician Jean Baptiste Fourier. The most common applications of Fourier transforms in areas related to molecular biology are in structural biology, in deconvoluting X-ray and electron diffraction patterns to obtain the structures of macromolecules.


Transform Ion Coupled Resonance A mass spectrometry method that is able to measure masses of, for example, peptide ions, with very high precision and accuracy. It is a versatile analysis method that may be used with both MALDI and ESI ionization technologies. It is an ion-trapping method, based on the principles of ion cyclotron resonance (ICR) spectrometry; one of its few disadvantages is that it is very sensitive to pressure and requires near vacuum conditions.


An alteration of the reading frame of a gene in which the sequence is read (i.e., translated into protein), caused by a change in sequence length (i.e., an insertion or deletion) of a number of nucleotides that is not divisible by 3. It usually results in the production of a truncated, nonfunctional amino acid because all the sequence downstream of the change will be translated incorrectly.

Free Induction

Decay In structure determination by NMR, free induction decay (FID) is a transient signal that decays (or relaxes) exponentially with time and that is caused by dephasing in an inhomogeneous magnetic field. The signal is sinusoidal and it is generated by spins in the x -y plane; it decays over time as magnetization returns to its equilibrium level.

Gain of Function Mutation

Trivially, any mutation where the protein produced by the mutated gene displays extra functionality (either a different function altogether or an enhancement in normal function) that is not present in the wild type. Gain of function mutations may be beneficial or deleterious; inheritance of these mutations is usually, if not always, dominant.

Gapped Alignment

Any alignment of two or more sequences that includes gaps, that is, that allows for insertions and deletions in the sequences. In practice, almost all alignment methodologies in modern bioinformatics produce gapped alignments; the only nongapped alignments are local alignments used where speed is at a premium. Early versions of BLAST produced nongapped alignments.

Gap Penalty

A penalty that is deducted from an alignment score for the addition of a gap in a sequence alignment. Alignment programs generally use two different gap scores: a large penalty for starting a gap (the gap insertion penalty) and a smaller penalty for extending one (the gap extension penalty). This reflects the fact that there is a greater difference between an ungapped alignment and one with an insertion or deletion of a single character (base or amino acid) than there is between an alignment with a single insertion or deletion and one with two.

Gene Duplication

An event during evolution in which a single gene is duplicated, giving rise to two different genes in the same genome. These genes will gradually diverge on an evolutionary timescale, giving rise to gene products with different sequences and usually different, although related, functions. Genes in the same genome that are related in this way are known as paralogs.

Gene Fusion

The use of recombinant DNA techniques to join (fuse) together two or more genes coding for different products so that they are expressed under the control of the same regulatory system.

Gene Gun

A piece of apparatus used for inserting transgenes into plant cells. Genes are loaded on to very small gold or tungsten pellets. These are then fired at the leaves or other tissues of the target plant, using the gene gun. The pellets pass through the plant tissue, but the genes are physically wiped off the pellets and may be incorporated into the plant chromosomes.

Gene Index

A list of genes; specifically, an annotated, nonredundant list of the genes in a genome, generally including other related genetic information and links. The TIGR Human Gene Index, which is freely available, contains data on the expression patterns, functions, and evolutionary relationships of the genes in the index. Gene indexes for other, less well studied genomes will tend to be less complete.

Gene Knockout

An informal term used in the lab and less frequently in non-specialist publications for the disruption of a gene by the addition or deletion of base sequences so that the function of the gene is abolished. Knockout mice, or mice in which the function of one gene has been removed, have very important uses in the study of genetic diseases.

Gene Silencing

The inactivation of a previously active (i.e., previously transcribed) gene. Its converse, gene activation, is used to mean the activation of a previously silent gene. Silencing (or activation) may take place by altering the transcription mechanism of the gene rather than the sequence of its coding regions.

Genetic Anticipation

A phenomenon in which some genetic diseases are observed to appear in more severe forms, either in terms of symptom severity, age of onset, or both, in subsequent generations. Genetic anticipation is common, and has been widely studied, in the trinucleotide repeat disorders (e.g., Huntington’s disease) but it has also been observed in some more common complex diseases with a strong genetic component, including Crohn’s Disease and biolar disorder.

Genetic Drift

A random change in the frequencies of different alleles in a population that is neither deleterious nor beneficial. The term is also used to mean random change as a mechanism of evolution; it is believed to be one of the two most important such mechanisms, with natural selection. Moran, in the reference below, states that random genetic drift is by definition a stochastic mechanism.

Genetic Imprinting

An epigenetic process by which the male and female germline of viviparous species confer specific marks on certain chromosomal regions, leading to the activation of either the paternal or the maternal allele only in somatic cells. Imprinted regions are characterized by increased and specific DNA methylation at particular CpG nucleotides. About 100-200 genes are believed to be imprinted in mammals, including man. Source: Kahl, G, The Dictionary of Gene Technology.

Genetic Profile

Simply, a profile of the variation in one or many genes in an individual or population. In medical genetics, the genetic profile of an individual may be used to predict their likely susceptibilty to disease; in evolution, and particularly in microbial evolution, the genetic profile of a population may be used to track its changes over time or space.

Gene Transfer

The main mechanism through which genetic material is transferred between species of bacteria. Bacteria are unable to reproduce sexually, so horizontal gene transfer is the only mechanism other than mutation through which variation is introduced into bacteria. The main methods are the uptake of “naked” DNA from one DNA species into the chromosome of the other, the transfer of plasmids or transposons, and the transfer of DNA using phages.

Gene Trap

A method of creating large numbers of insertional mutants in the mouse genome, which is both high throughput and cost-effective. A gene trap vector is inserted at random into the genome of mouse embryonic stem cells, simultaneously disrupting the gene at the site of the insertion. A reporter gene is used to monitor the expression of the inserted gene. The resulting databases of mutant stem cell lines may be used to establish mutant strains of mice via the creation of chimaeras.

Genomic Control

A method of reducing the chance of finding spurious associations between genes and disease in the large populations that are necessary for the study of the genetics of complex diseases, caused by population heterogeneity. By studying multiple polymorphisms scattered through the genome, many of which are known not to be associated with the disease in question, it is possible to estimate population heterogeneity and take it into account.


The genetic composition of an individual (of any species), as opposed to the physical features imposed by that genotype (the phenotype). The term may be used either to describe the alleles that are present at a particular locus or to describe the organism’s overall genetic composition. Thus, to take a simple case, the genotype of a cystic fibrosis carrier at the CF locus is different from that of an unaffected individual, although their phenotypes are (in that aspect) indistinguishable.

Germ Cell

A eukaryotic cell that has been produced by meiosis and that is therefore haploid (containing only one copy of each chromosome). Germ cells are egg cells in the female and sperm cells in the male. Most other types of eukaryotic cell cannot pass information to the next generation and are known as somatic cells.


All cells in an individual that contain genetic material that can be passed on to that person’s children are part of the germline. Self-evidently, this includes the egg and sperm cells, but it also includes the cells from which those cells are derived – the gametocytes. If a mutation occurs in a germline cell (a germline mutation), that change may be passed on to future generations.

Gibbs Sampling

An algorithm for finding patterns (corresponding to, e.g., structural properties or functional motifs) within a set of DNA or protein sequences. One sequence is left out of the set; the other sequences are aligned and the alignment used to produce a scoring matrix. This is matched to the extra sequence and used to predict the pattern; a second sequence is then left out of the resulting complete alignment and the procedure repeated until the matrix can no longer be improved.

Global Alignment

Any sequence alignment technique in which the assumption is made that the (gene or protein) sequences are homologous along their entire lengths. Gaps are inserted into one or both sequences in an attempt to stretch the alignment to cover all the sequences. This method is appropriate for, for example, aligning orthologs from different genomes: it is not appropriate for aligning whole genes with partial ones or cDNA with genomic DNA.

Global Free Energy Minimum

The conformation of a molecule (or molecular complex) that has the lowest free energy, as measured (generally) using a molecular mechanics force field. The global minimum is distinct from a large number of local energy minima in different parts of conformational space. Molecular mechanics calculations make the assumption that a molecule or system is most likely to be found in its global minimum, and the difficulty of distinguishing this from the local minima is one of the drawbacks of this methodology.


A generic term used to describe any macromolecule that consists of an oligo- or polysaccharide (i.e., a glycan) covalently bound, or conjugated, to another type of molecule. Glycoproteins and glycolipids are examples of glyco-conjugates.


The study of the glycome of a cell or organism. By analogy with genomics and proteomics, the glycome is defined as the complete set of simple and complex carbohydrates that it makes. The glycome, like the proteome, is many times more complex than the genome.


A glycoprotein is a protein that is glycosylated – that is, one in which one or (more often) more asparagine, serine, or threonine side chains have been covalently linked to sugar moeities. Glycosylation is a posttranslational modification.

Glycosidic Linkage

The covalent link between the protein and carbohydrate parts of a glycoprotein or proteoglycan. There are two main types of such linkages: N-glycosidic linkages, where the oligosaccharide is attached to the amide nitrogen of an asparagine residue, and O-glycosylation, where the oligosaccharide is linked via the side chain hydroxyl group of a serine or threonine residue.

Glycosylation Island

A locus on a eukaryotic chromosome that contains genes that code for proteins involved in glycosylation. The genes are close enough together for the locus to be defined as an operon.


Anchor One of the three main groups of post-translational glycosylation modifications of protein sequences, the others being O-glycosylation and N-glycosylation. Proteins with GPI anchors are attached to the cell membrane by means of the anchors; they are found in all eukaryotic genomes. It is possible to predict GPI anchor attachment sites from sequence using bioinformatics tools.

Graph Theory

The branch of mathematics that is concerned with the study of graphs. A graph is defined as an array of points (vertices or nodes) that are connected by lines (edges or arcs). In bioinformatics, graph theory may be used, for example, to analyze the expression patterns of a group of genes. If the edges have a direction (e.g., representing the fact that one gene controls the expression of another), the graph is termed a directed graph.

Greedy Algorithm

An algorithm that always takes the best immediate, or local, solution while finding an answer. Greedy algorithms find the overall, or globally, optimal solution for some optimization problems, but may find less-than-optimal solutions for some instances of other problems.

Grid Computing

The Grid has been termed “the second-generation Internet”. It is a vision, which is slowly becoming realized, of networked computers set up so that processing power is as accessible as data is (via the World Wide Web) today. Each computer linked to the grid will be able to “plug in” to a range of services including processing power, communications and storage facilities. The so-called “at home” services in which “spare” PC power is used to solve complex problems, such as Folding@Home in protein folding, are early examples of grid computing. Bioinformatics protocols haved already been set up using the grid.


A structural element in either protein or RNA in which a linear chain folds back on itself forming a relatively straight piece of structure with a short loop at one end. In proteins, the two linear regions of chain are beta strands held together with main chain-main chain hydrogen bonds, and the structure is also known as a beta hairpin. In RNA, the linear regions are held together by base pairing, and the structure may also be known as a stem-loop.

Hamming Distance

In information theory, the Hamming distance is the number of positions in two character strings where the characters are not identical. The strings are of equal length. This has obvious implications for sequence comparison and pattern matching of genes and proteins, for example, the Hamming distance between the two fragments of protein sequence AFDTGH and VGDTGN is three.


A block of DNA sequence that is usually inherited as a whole, at least in a specific population: that is, a block of sequence where linkage disequilibrium is low. The identification of haploblocks is of great value in identifying and mapping genetic associations for complex diseases.


A eukaryotic cell is described as haploid if it contains only one copy of each chromosome. Thus, a human haploid cell contains one copy of each of the 22 autosomes and one sex chromosomes (either X or Y), making 23 chromosomes in all. Normal gamete (egg and sperm) cells are haploid.


The reduction in gene dosage caused by the mutation of one allele of a gene such that the mutated allele cannot be expressed (i.e., the mutant protein is nonfunctional, truncated, or rapidly degraded). The nonmutant allele, however, is synthesized normally, resulting in the concentration of that protein in a cell being approximately half the normal concentration. Source: Kahl, G, The Dictionary of Gene Technology.


The specific pattern and order of alleles on a chromosome (a specific strand of DNA). Haplotypes tend to be conserved from generation to generation; in particular, alleles that are located close together on a chromosome are likely to be inherited together.

Haplotype Map

A map of a chromosome showing the location of specific haplotype blocks. A haplotype block is a block of alleles that are normally inherited together: that is, a stretch of DNA between two areas of high linkage disequilibrium. Haplotype mapping may be used for the detection of genes associated with common, multigenic disorders.

Helix Packing

The way in which alpha helices pack together in protein structures, to maximize the attractive interactions between the helices. The helix side chains pack together in a way described as the “knobs in holes” model; interhelical angles of 20 degrees (as in four-helix bundles) and 50 degrees (as in the globin family) are preferred.


This term may be used to mean, either, the part of chromatin that is maximally condensed in interphase nuclei, replicates late in the S phase and is mostly transcriptionally inactive (such as satellite DNA); or, in a different context, the DNA content of the sex-linked chromosomes (such as human X and Y), which are sometimes termed heterosomes or heterochromosomes. Source: Kahl, G, The Dictionary of Gene Technology.


Any double-stranded nucleic acid molecule (or duplex) in which the two strands have different origins, whatever those origins are; they may be DNA sequences arising from different genomes or from paralogous genes in the same genome, or they may be an mRNA with its parent DNA. Heteroduplexes may contain loops of single-stranded material lacking a complementary sequence on the opposite strand.

Heterologous Gene

Any gene that has been isolated from one organism and transferred into another (i.e., a transgene). Heterologous genes may be contrasted with homologous genes, which are genes that have been taken out of one organism, manipulated (e.g., by introducing site directed mutations) and then transferred back into the same organism.

Heterozygote Advantage

A case where the disadvantage conferred on homozy-gotes for a particular allele is balanced by an advantage conferred on heterozygotes. If heterozygotes have sufficient survival advantage over individuals without the allele, the allele will increase in frequency despite poorer survival of those with two copies. The allele for sickle cell hemoglobin is a well known example: het-erozygotes (with so-called sickle cell trait) are less susceptible to malaria than those without the trait, which balances the disadvantage of homozygotes suffering from sickle cell anemia.

Hidden Markov Model

A complex, powerful probabilistic prediction technique that has many applications in bioinformatics: for example, predicting gene structure from DNA sequences, protein secondary structure from protein sequences, and classifying genes and proteins into families. The algorithm involves the prediction of hidden states (e.g., whether a particular base is or is not coding) based on observable ones (e.g., the nucleic acid sequence).

High Pressure Liquid Chromatography

HPLC is a very commonly used separation technique with many applications in biotechnology, particularly in proteomics. A complex mixture is passed through a matrix material under high pressure, which separates the components of the complex by mass. HPLC is used for protein separation, protein and nucleic acid purification, and peptide sequencing.


Histones are basic proteins that bind DNA and that are used to package long DNA molecules into the nuclei of eukaryotic cells. This must be a complex process as the average length of a human chromosome when extended is 4-5 cm. DNA-histone complexes are termed chromatin. Posttranslational modification of histone sequences has been implicated in imprinting.


During mitosis, the chromosomes of some eukaryotic species bind to the microtubules along their entire length, and move from there to the poles broadside. These chromosomes are termed holocentric, in contrast to monocentric chromosomes, which bind to the microtubules at the centromere and move toward the poles with that leading. The majority of eukaryotic species, including most model organisms, have monocentric chromosomes; however, the chromosomes of the nematode C. elegans are holocentric.


A family of genes involved in the control of development in eukaryotes. They code for transcription factors that have been implicated in the formation and differentiation of many tissue and organ types. Homeobox gene sequences are well conserved throughout the evolutionary history of eukaryotes and they have been used to study mechanisms of evolution.


Briefly, homeostasis is the maintenance of equilibrium, or resistance to change. It is a feature of living organisms at all levels, from the molecular, through the cellular to the level of the whole organism. In higher eukaryotes, the maintenance of equilibrium is complex and requires the interaction of many different feedback mechanisms. The mechanisms by which the presence of a metabolite can inhibit the enzyme reactions necessary for its production are very simple examples of these.


Gene or protein sequences are defined as homologs (or homologous sequences) if and only if they are related by divergent evolution from a common ancestor. Sequence analysis programs determine the degree of identity between sequences; homology can only be inferred from probability, often using functional information.

Homology Modeling

A technique for predicting the structure of a protein from its sequence using one or more structures of homologous proteins. This is the most accurate method of predicting protein structure, and can be as accurate as a medium resolution X-ray crystal structure. It is based on a multiple alignment of the test sequence with the sequences of known structure. Generally, conserved regions of structural or functional importance are copied from one of the known proteins and loops are then modeled separately.


Any similarity between two or more sequences (gene or protein), or two or more phenotypic traits, that is not an indication of a common evolutionary origin. Convergent evolution, where the same or a similar solution to a particular problem arises independently more than once, is an example of a process that may lead to homoplasy.

Housekeeping Gene

A gene that is constitutively active in all cells of an organism and at most developmental stages, because the protein that it encodes is essential for the maintenance of life (e.g., an enzyme that forms part of a general anabolic or catabolic pathway). The concentration of the proteins encoded by these genes is kept at a fairly constant level within the cell. Genes that are only active under some conditions are termed inducible genes. One classic example is the COX family of enzymes; COX-1 is a constitutive gene, whereas COX-2 is induced as part of the inflammatory response.

Human Genome Project

Trivially, the project to sequence the human genome. It was set up in 1990 and expected to take 15 years; however, thanks, largely, to the rivalry between the original public collaboration led by Drs Francis Collins at the NIH and John Sulston at the UK’s Sanger Institute, and the private company Celera Genomics founded by Craig Venter it finished 2 years ahead of schedule. The working draft was published in February 2001 and the complete sequence in April 2003. All human genome data is now freely available.


The formation of a nucleic acid duplex from two complementary (or near complementary) single strands, either naturally or induced. Hybridization experiments are used to detect sequence similarities and form the basis of microar-ray technology. In this, which is one of the mainstays of modern bioinformatics, mRNA molecules are detected by hybridization with fragments of complementary cDNA, immobilized on the microarray (or so-called “DNA chip”).

Hydropathy Profile

A graph that plots the average hydrophobicity of a segment of a protein chain against the amino acid at the centre of that segment. The average hydrophobicity is calculated from the amino acid content of the segment using a hydrophobicity scale. In most widely used scales, very hydrophobic amino acids are given high positive scores, so hydrophobic regions of the sequence – which may, for example, represent transmembrane regions – are represented as “peaks” on the hydropathy plot.


Hydrophobic literally means “water-hating”. Molecules that are hydrophobic (such as hydrocarbons) are more soluble in oily solvents, such as octanol, than they are in water. Over a third of the amino acids that occur naturally in proteins are hydrophobic; phenylalanine, leucine, and valine are good examples. The fact that these amino acids will be driven into the interior of the protein, away from the solvent, is one of the principal factors driving protein folding.


Effect The force that drives hydrophobic molecules or parts of molecules (such as hydrophobic side chains in amino acids) away from solvent molecules and into contact with other hydrophobic molecules. The hydrophobic effect drives the formation of the hydrophobic core of globular proteins and is the principal force driving their folding. The solvent accessible surface of proteins is principally formed by hydrophilic amino acids.


Moment Many alpha helices are significantly amphipathic, with hydrophobic amino acids clustered on one side of the helix and polar and charged ones on the other. In protein structures, amphipathic helices will often be found with the hydrophobic face pointing toward the more hydrophobic environment (the interior of a soluble protein or the lipids of a cell membrane). The hydrophobic moment of a helix is a mathematical concept that measures amphiphilicity, and that is used in protein structure prediction. It is determined by summing the set of vectors in the direction of each amino acid with lengths proportional to their hydrophobicity.

Hypomorphic Allele

An allele that produces a protein that has the same function as the wild type protein but with a reduced level of activity, or, alternatively, an allele that produces the wild type protein at lower levels of expression. There will be serious consequences if the function of that gene product is concentration dependent. Hypomorphic alleles are produced by hypomorphic mutations.

Immobilized pH Gradient

A polyacrylamide support matrix, which contains chemically immobilized carrier ampholytes such that a stabilized pH gradient is generated along the strip. IPGs allow the separation of larger amounts of protein than is possible using conventional isoelectric focusing techniques. Source: Kahl, G, The Dictionary of Gene Technology (Second Edition).


A technique that uses antibody-antigen binding to prove protein expression and locate a protein within a cell or tissue. Proteins are located using specific antibodies that are conjugated to dye molecules, and the dye located under a microscope. All techniques that use dye stains for molecular localization are collectively termed cytochemistry.


Any method for locating specific protein antigens in cells or tissues using an antibody that is specific for that antigen that is conjugated with a peroxidase. The antibody-antigen complex is detected by, for example, the peroxide-dependent conversion of luminol, which is accompanied by the emission of light.


Loosely, a phenomenon in which the phenotype expressed by an allele differs according to the sex of the parent who passed on that chromosome. In mammals, the term is usually restricted to those cases where the gene from either the material or the paternal chromosome is inactivated. The gene in question can be referred to as an imprinted gene. In some cases, the phenotype of a genetic disease will depend on whether the defective gene was inherited from the mother or the father. Imprinting is thought to derive from epigenetic differences between the maternal and paternal alleles.

Imprinting Centre

Imprinting is a phenomenon in which the phenotype expressed by an allele differs according to the sex of the parent who passed on that chromosome. It arises because some genes from either the maternal or the chromosome are normally inactivated during germ cell development. The chromosomal regions that determine this are known as imprinting centres. Deletions of and errors in imprinting centres give rise to inappropriate imprinting and therefore to genetic disorders.


In graph theory, the indegree of a node in a directed graph is the number of edges that terminate at that node. This is often applied to the analysis of gene networks derived from microarray experiments, where the relationship denoted by an edge is that one gene affects the transcription of another. A gene with a high indegree is one that is affected by many others, that is, which is highly regulated. Experiments with yeast microarrays have found that most of the genes with high indegree are involved in metabolism.


A shorthand way of expressing “insertion or deletion” in a sequence alignment, expressing the fact that it is impossible to tell (at least without very detailed phylogenetic analysis) whether a gap in an alignment arose from an insertion in one sequence or a deletion in another. In some contexts, the word “indel” may be used synonymously with “gap”.


Case In studies of infectious disease, the index case is the first person to become infected with a disease, and so the source of the outbreak. In studies of genetic disease, the term has been generalized to mean the affected individual through whom an inherited disease-causing mutation is identified in a family.


A chemical substance, generally of low molecular weight, that binds to a regulator protein and alters its activity in such a way that the transcription of a specific gene or operon, which has previously been repressed, is reactivated. The generic term “effector” is used to indicate a chemical that binds to a regulator and so controls its activity.

Integrative Biology

Integrative biology is often used as a synonym for systems biology. As such, it can be defined trivially as the computer-based analysis or simulation of molecular data within the context of a system. A system may be as (relatively) simple as a metabolic or regulatory network within a single cell, or it may be a cell, tissue, organ, or organism. Integrative or systems biology may therefore include models of different types and of different levels of precision.


In the cell cycle, the period between cell divisions in which the chromosomes are in an extended form within the cell nucleus and cannot be distinguished separately. Interphase is the phase of the cell cycle during which cells grow and carry out their functions. Cytogenetic tests such as FISH are easier if they can be carried out during interphase, as cell culture is not necessary, but chromosomal abnormalities can usually not be identified.


Those sequences within a eukaryotic gene that are not conserved during pre-mRNA processing and so do not make up the mature message. The introns on the 5′ and 3′ ends of the mRNA may contain sequences that signal initiation or termination of processing, respectively. Prokaryotic genes do not contain introns.

Inverse Protein Folding Problem

The problem of finding sequences that conform to (i.e., that are likely to fold into) a given protein topology. It is so-called because it is the inverse of the more common problem of finding the structure that a particular sequence is likely to fold into.

Inverted Terminal Repeat

Sequence motifs that flank transposons and that are identical or partly identical and present in inverse orientations. Their function is as recognition sites for the excision of transposons. Source: Kahl, G, The Dictionary of Gene Technology (2nd Edition).

Ion Trapping

A term used for a group of mass spectrometry methods that are able to measure masses of, for example, peptide ions, with very high precision and accuracy. The peptide ions may be created by any standard MS method (e.g., MALDI or ESI); they are focused into the helium-filled ion trap using an electrostatic lens. The positions at which the ions are stably trapped depends on the equipment parameters and their mass/charge ratios, and this enables the m/z ratios to be calculated.

Isobaric Residues

Amino acid residues that have the same molecular mass, and that therefore cannot be distinguished in peptide sequencing using mass spectroscopy (e.g., leucine and isoleucine) are termed isobaric residues.

Isoelectric Point

The isoelectric point of a protein is defined as that point on the pH scale where its net positive and negative charge(s) equal zero. During electrophoresis, a protein migrates to a position on a stabilized pH gradient where the pH is equivalent to its isoelectric point.


Multiple forms of the same enzyme, which catalyze the same reaction, but may differ in amino acid sequence, physical properties, and regulation. Isozymes may consist of complexes of different, possibly randomly selected, polypeptide chains. They may be separated by conventional biochemical methods.

Iterative Improvement

An algorithmic technique that solves a problem by repeatedly estimating a “slightly wrong” solution, estimating the slight error and subtracting it from the wrong solution to give an improved solution. The process is repeated until the error is smaller than a set value.


The complete set of chromosomes in a cell, an individual or a species. The karyotype of a cell or an individual will include gross chromosomal abnormalilties (e.g., in chromosomal number). The word karyotyping is used to describe, generically, a number of techniques for determining the karyotype of an individual; FISH is one example. These may be used to detect aneuploidies such as trisomy 21 (Down’s syndrome).


An informal term for an animal model (very often, but not invariably, a mouse) in which a single gene has been inactivated (silenced or “knocked out”) by either random or site directed mutagenesis. Phenotypically, gene knockout animals range from normal to nonviable (i.e., embryonic lethal mutations). The term “knock-in” may be used to describe a model in which the function of an inactive gene is restored by mutation.

Lagging Strand

The DNA strand that is discontinuously synthesized in a 5′ to 3′ direction away from the replication fork during DNA replication. It contains the ligated Okazaki fragments that are linked by ligases to form a continuous strand: each of these is several thousands of nucleotides long in prokaryotes or several hundred nucleotides long in eukaryotes. Source: Kahl, G, The Dictionary of Gene Technology.

Leading Strand

The DNA strand that is continuously synthesized in a 5′ to 3′ direction toward the replication fork during DNA replication. The opposite strand is the lagging strand, which is synthesized discontinuously. Source: Kahl, G, The Dictionary of Gene Technology.

Leucine-rich Repeat

Short amino acid sequence repeats with a high proportion of leucine residues that are found in tandem arrays in many proteins from different functional families. They are believed to provide a versatile structural framework for the formation of protein-protein interactions, and to be necessary for cytoskeleton morphology and dynamics.


A generic term for a nonprotein molecule that must be bound to a protein in order for that protein to function. Ligands are usually, but not always, of low molecular weight. In receptor theory, the term ligand is used to indicate the naturally occurring compound that binds to the receptor in order to elicit a response, as opposed to an agonist or antagonist that is added artifically. However, the term may be used to indicate, for instance, an enzyme inhibitor.

Ligase Chain Reaction

An in vitro DNA amplification procedure that uses the enzyme DNA ligase to amplify a template. A pair of synthetic oligonucleotides is allowed to anneal to adjacent complementary regions of one strand of the target double stranded DNA, and two other oligos anneal to adjacent complementary regions of the other strand. Each pair of oligos is ligated by DNA ligase, and the ligation product used as a template for subsequent ligation cycles. Source: Kahl, G, The Dictionary of Gene Technology.


Computer software used for the automatic management of laboratory functions, which could involve anything from the management of samples and standards to invoicing. LIMS as used to control workflow in complex biotechnology laboratories can be considered a branch of bioinformatics, but it is currently only used to any extent in an industrial context, such as managing high-throughput screening in the pharmaceutical industry.


Any group of individuals that are derived from a common ancestor may be termed a lineage. Thus, in phylogenetics, the term lineage is synonymous with clade. However, lineage is also used to refer to a family of individual (human or nonhuman) organisms, or, alternatively, a population of differentiated cells derived from an individual precursor (as in “tumour cell lineage”).

Linear Ion Trap

Ion trapping is a mass spectrometry method that are able to measure masses of, for example, peptide ions, with very high precision and accuracy and in which the peptide ions are focused into the ion trap using an electrostatic lens. A linear ion trap is an enhancement that reduces the number of dimensions of the ion trap from three to two; the ions are trapped radially by a radio frequency containment field, but axially by a static electric field. Linear traps have increased efficiency, sensitivity, and dynamic rang.

Linkage Disequilibrium

The occurrence of two or more linked alleles together at a higher frequency than would be expected from their individual frequency in a particular population. The tighter the genetic linkage between a pair of loci is, the higher degree of linkage disequilibrium is observed. Source: Kahl, G, The Dictionary of Gene Technology.

Linkage Mapping

The process of deriving a linkage map (or genetic map) of a chromosome location from DNA samples from related and nonrelated individuals, plotting the relative positions of markers based on the frequency of crossovers or recombinations. The genetic distance between two markers – that is, the average number of crossovers during meiosis at the two loci – is given in centiMorgans (cM).

Lipid Raft

A small area within a cell membrane that is particularly rich in different kinds of lipids: glycolipids, sphingolipids, and cholesterol. Lipid rafts also contain proteins embedded in the membrane using GPI anchors. Many of these proteins are involved in cell signaling, and lipid rafts are also thought to play a role in signaling processes. They are found in both prokaryotic and eukaryotic cells.


A complex formed between cationic lipids and DNA, used in nonviral vectors for gene therapy. Complexes formed by cationic polymers for the same reason are termed polyplexes. The DNA in both these types of complexes is protected from degradation by nucleases. Cationic lipids are very useful as components of gene therapy vectors as they are easy to prepare and characterize.

Liquid Chromatography

Any separation technique in which a liquid sample of a complex mixture is passed through a column containing a matrix in such a way that the components of the mixture are separated (e.g., according to their mass). High pressure liquid chromatography (HPLC) has many applications in proteomics and in biotechnology in general.

Liquid Chromatography/Mass Spectroscopy

A reliable method for the separation and identification of proteins, involving linking the output from a liquid chromatographic system to a mass spectrometer. Separation of proteins using liquid chromatography is considered to be competitive with the more widely used 2D-PAGE method. Often, the mass spectrometry step is repeated, leading to protein identification: hence LC-MS/MS.

Local Alignment

Any pairwise sequence alignment technique in which the assumption is made that the (gene or protein) sequences are not homologous along their entire lengths. Local alignment programs report one or more regions of sequence similarity; where multiple regions are reported, these do not necessarily need to be in the same order in both sequences. This method is appropriate for, for example, aligning whole genes with partial ones, cDNA sequences with genomic DNA, or single domains within multidomain proteins.


The locus of a gene is its location on a chromosome or on a gene map. A single locus may contain several contiguous genes, which are likely to be functionally and/or evolutionarily related: for example, the human cytochrome P450 3A locus on chromosome 7 contains the genes for three different CYP450 isoforms and related pseudogenes.

Locus Control Region

Any DNA sequence that exerts a dominant, activating effect on the transcription of genes in a large chromatin domain (10-100 kb). LCRs prevent the influence of heterochromatic silencing on neighboring sequences. They are therefore used in transgenic experiments as insulator elements that protect themselves and linked genes against the repressive action of heterochromatin. Source: Kahl, G, The Dictionary of Gene Technology.

Next post:

Previous post: