LOD Score To Penetrance (Genetics,Genomics,Proteomics,Bioinformatics)

LOD Score

A mathematical description of genetic linkage. The LOD score is defined as the logarithm (to base 10) of the ratio of probabilities that the observed results are produced by linked or unlinked loci. A LOD score of 3 or more indicates that the loci are linked.

Long-branch

Attraction In phylogenetic analysis, a phenomenon that is thought to bias any attempt to root the universal tree of life toward a eubacterial root. Since, in a universal tree, the eubacterial brance is always the longest one, its selection as the universal root may be explained by an attraction between this branch and the long branch of the outgroup.

Long Terminal Repeat

The repeat sequences at the ends of a retroviral nucleic acid. In proviruses, the upstream LTR functions as a promoter/enhancer and the downstream LTR as a poly-A addition signal. Long terminal repeats are several hundreds of base pairs in length and the repeated sequence is of 4-6 bp. These sequences can be used as elements of integration vectors. Source: Kahl, G, The Dictionary of Gene Technology.

Low Complexity Regions

Regions of DNA or protein sequence that either repeat a single residue or short residue pattern, or else contain a much higher than average percentage of a particular residue type. In gene sequences, low complexity regions are often microsatellites; in protein sequences they may represent, for example, glycine or cysteine rich regions. Low complexity regions are often masked (ignored) in sequence alignments and searches because they may generate spurious (unrelated) matches.


Luciferase

An enzyme, isolated from fireflies and some bacteria, which catalyzes the decarboxylation of D-luciferin to oxyluciferin. This reaction generates a flash of light, which can be easily monitored. Luciferin is often used to monitor protein expression, particularly in transgenic cells.

Luciferin

Any compound that is a natural substrate for the luminescent enzyme luciferase, which is used in proteomics as a reporter for gene expression. Structurally unrelated substrates of luciferase have been isolated from various species including the firefly Photinus pyralis and the ostracode Cypridina. These are generally medium-sized organic molecules containing aromatic heterocycles. Source: Kahl, G, The Dictionary of Gene Technology.

Machine Learning

An area of artificial intelligence in which a computer is allowed to “learn” the pattern and structure of a dataset by analyzing it, and use that to classify data not in the original dataset. Machine learning overlaps significantly with statistical analysis. It is more popular than other areas of artificial intelligence in modern bioinformatics, finding applications in sequence analysis and in the analysis of microarray data.

Main Chain

The backbone of a polypeptide chain, consisting of linked peptide groups and alpha-carbon atoms: thus, the main chain atoms of a single amino acid can be written using standard terminology as -C(O)-C-N(H)-. The peptide group is planar, which restricts the geometric conformation of the main chain. It is the side chains bonded to the alpha-carbon atoms that give the amino acids, and therefore the proteins, their chemical diversity.

Malecot Model

An algorithm for prediction of the decay of linkage disequilibrium with distance, using three parameters. Distance can be measured either in centimorgans or, if it can be assumed that recombination is uniform over the region, in kilobases.

Markov Model

A probabilistic statistical model used in many bioinformatics applications. One example is its use in sequence analysis, where the probability of each nucleotide or amino acid occurring is dependent on those preceding it. A hidden Markov model (HMM) is a Markov model in which one or more variables are hidden.

Mass Resolution

The extent to which a mass spectrometer can distinguish between samples of similar mass. Modern Fourier Transform mass spectrometers have very good mass resolution, being capable of identifying peptides and even distinguishing between isotopes.

Mass Spectrometry

In proteomics, mass spectrometry is used to identify proteins from small samples that have been separated by, for example, 2DPAGE. The proteins are first fragmented into peptides using proteases (typically trypsin). Mass spectroscopy involves the ionization of the peptide sample, the separation of the ions by mass-charge (m/z) ratio, and the analysis of the separated ions and identification of the protein from its constituent “peptide fingerprint”. Different technologies exist for peptide ionization (e.g., MALDI and ESI), ion separation (e.g., time-of-flight) and mass fingerprint analysis.

Mass Tolerance

A measure of the precision expected from a mass spectrometry experiment, which is used in determining whether, for example, an experimentally measured ion corresponds to a certain peptide. The Mowse scoring algorithm for MS matches states that “each calculated value which falls within a given mass tolerance of an experimental value counts as a match”. Typical mass tolerance values are 2.0 a.m.u. for peptides and 0.8 a.m.u. for fragment ions.

Matrix Assisted Laser Desorption/ionization

A technique used in protein mass spectroscopy, as applied to the analysis and identification of separated proteins obtained in proteomics experiments (e.g., 2DPAGE). A solution of the digested peptides is passed through a thin needle with a nebulizing gas, and a high voltage applied to the tip. This generates a spray of droplets containing ions. The droplets evaporate leaving peptide ions that pass through a series of electrodes and samplers to the mass analyzer, which usually works on the time-of-flight principle.

Maximum Likelihood

A statistical method pioneered by a geneticist, Sir Ronald A. Fisher. It is a method of point estimation that estimates the value of an unobservable parameter as that value that maximizes the likelihood function. The log of the likelihood (the log-likelihood) is an often quoted value.

Maximum Parsimony

One of three methods commonly used in phylogeny to select the most probably phylogenetic tree relating a set of sequences. In maximum parsimony, the “correct” tree is assumed to be the one that minimizes the number of step changes (i.e., single base or amino acid changes) from the presumed common ancestor that are needed to complete the tree. It generates unrooted trees; it is a very reliable method, but is time consuming and CPU intensive, so best used with small numbers of similar sequences.

Membrane Anchor

A single segment of protein chain that either embeds in, or passes through, a cell or organelle membrane, anchoring the protein to that membrane. Proteins containing membrane anchors must contain either a cytoplasmic or an extracellular functional domain, and may contain both these: the function of the anchor is to attach that protein to a particular point at the surface of the cell or organelle. If such a protein contains more than one domain, the anchor is bound to act as a domain boundary.

Membrane Protein

A protein that passes through a cell membrane, either once or more than once. Apart from proteins that are embedded in the outer membranes of Gram negative bacteria, which are beta-barrels, all membrane proteins contain one or more transmembrane helices. Type I and type II membrane proteins have a single transmembrane helix separating extracellular and cytoplasmic domains. Integral membrane proteins contain helix bundles that are embedded in the cell membrane. Some types of membrane protein contain signal anchor sequences towards their N-terminal ends.

Mendelian Disease

A genetic disease or disorder that is carried by a single gene. Mendelian diseases have a penetrance approaching 100%, that is, all people who carry an abnormal variant of the gene on one or more alleles (depending on the inheritance pattern) will suffer the disease to a greater or lesser extent. Examples include cystic fibrosis (recessive inheritance), Huntington’s disease (dominant inheritance), and hemophilia (sex-linked inheritance).

Metabolic Pathway

The linking of small, biosynthetic molecules via the enzymes that synthesize them in the normal metabolism of any species, to form a network; one widely studied example is the glycosidic pathway through which glucose is hydrolyzed to pyruvate with the release of ATP (energy). Information about metabolic pathways is held in databases including KEGG and WIT.

Metabolomics

The study of the metabolome. This is defined, by analogy with “genome” and “proteome” , as the sum total of all metabolites (the “small molecules” that are substrates, intermediates and products in metabolic reactions within a cell. Like the proteome, the metabolome varies between cell types and, within a cell type, according to developmental stage and environmental conditions.

Meta-Data

Data held within a database that is accessory to and associated with the primary data in the database. For example, the metadata held in a protein sequence database might include gene name and chromosomal location, Gene Ontology annotations, enzyme activity and metabolic pathway involvement. The term metadata may also be used to describe information about an HTML document that is held within the file but not displayed by a browser.

Metaphase

The phase during eukaryotic cell division (mitosis or meiosis) between prophase and anaphase, in which the nuclear membrane has broken down and the daughter chromosomes align in the center of the cell before being drawn toward its ends by the microtubules. This is the stage in mitosis when the chromosome pairs are most clearly visible, so it is useful for cytogenetic analysis.

Meta Server

A server on the Internet that provides access to a number of other servers that provide programs with very similar functions (but probably different methodologies), for example, protein structure prediction, allowing users to compare the results of the different programs. Groups that provide meta servers often do not provide their own methods, but do give anaylsis of the different methodologies.

Metric Map

Any map of a genome or chromosome, whether defined by linkage, marker or polymorphism, in which the distance between the elements is recorded as well as their order. Linkage disequilibrium maps may be made into metric maps when the linkage disequilibrium is plotted against the physical distance between markers on the chromosome.

MIAME

A series of guidelines set up by the MGED (Microarray Gene Expression Data) Society to enable sharing of microarray data within the gene expression profiling community. The guidelines are designed so one group will be able to reproduce exactly a microarray experiment produced by another. This has been facilitiated by the invention of an XMA format markup language, MAGE-ML, for the storage of microarray data.

MIAPE

A series of guidelines set up by the Human Proteomics Society, by analogy with the MIAME guidelines for microarray experiments, to enable sharing of proteomics data within the community. The guidelines are designed so one group will be able to reproduce exactly a proteomics experiment produced by another.

Microarray

An ordered array of (usually) cDNA fragments, arranged at extremely high density on a solid support, and used for analysis of the mRNA content (transcriptome) of a cell. The experiment is set up so that a signal is generated if the sample contains mRNA molecules that can hybridize to a given cDNA.

Microchimerism

A relatively common phenomenon in which cell lines with different chromosomal compositions are found in one individual. Unlike mosaicism, however, in microchimerism, the cells are derived from two separate individuals. Sometimes cells are exchanged between twin fetuses in the uterus (so-called twin-to-twin transfusion); more often, there is an exchange of cells between mother and fetus during pregnancy, and the mother’s cells may persist throughout the lifespan of the offspring. Microchimerism has been implicated in a number of autoimmune diseases.

Microelectromechanical Systems

Extremely small, microfabricated elec-trophoresis systems that have been proposed as a potential solution to the remaining cost limitations of genome sequencing. The technology requires multichannel devices and the ability to process samples on the nanoliter scale. Many such devices have short read lengths and these may be most suited to resequencing or geno-typing.

Microsatellite

Any short (typically 1 -6 bp) tandem repeat in a genome sequence, that is, any short base pattern repeated a number of times. Microsatellites are common throughout eukaryotic DNA. They are often “masked” in sequence searches because a microsatellite match may swamp a match to a distant homolog. Most microsatellites occur in intergenic DNA (so-called “junk DNA”) but occasionally one occurs in a coding region, for example, the (CAG)n motif in the huntingtin gene which is expanded in Huntington’s disease patients.

Middleware

A type of software that is used as an intermediary between different components; for example, the different components of software that sit between a database user on a client system and the database server. There is a sense in which an ontology can be described as a piece of middleware.

Minisatellite

A short, repetitive, usually GC-rich tandemly arranged DNA sequence. Minisatellites (9-64 bp) are longer than microsatellites (1-8 bp). They occur in all eukaryotic genomes, but are more common in large genomes of complex organisms. Minisatellites tend to show significant length polymorphism.

miRNA

Very small mRNA molecules, only 20-25 nucleotides long, that are involved in the regulation of gene expression. They are transcribed from DNA sequences, initially as longer sequences that contain the miRNA and an almost self-complementary sequence that forms a hairpin. The mature miRNA is cleaved out of the precursor sequence by enzymes. It is complementary to part of a coding gene and may anneal to the mRNA, preventing protein translation.

Missegregation

Any process by which chromosomes fail to segregate correctly during cell division, leading to the formation of daughter cells with abnormal and/or missing chromosomes. Chromosomal missegregation often occurs during the division of cancer cells, leading to further errors. The production of abnormal and even extra spindle poles has recently been implicated in this process.

Missense Mutation

An ordered array of (usually) cDNA fragments, arranged at extremely high density on a solid support, and used for analysis of the mRNA content (transcriptome) of a cell. The experiment is set up so that a signal is generated if the sample contains mRNA molecules that can hybridize to a given cDNA.

Mitosis

The process of cell division that takes place in eukaryotic cells at all times except gametogenesis, and in which the chromosomes are replicated, maintaining chromosome number. Thus, one diploid cell will – in the absence of replication errors – produce a pair of identical diploid cells.

Model-based Analysis

A type of test used in statistical genetics in which the frequency and penetrance of an allele that has been implicated in disease can be estimated with sufficient accuracy to be used in a mathematical model. It is most commonly used for simple genetic diseases; model-free analysis is usually used to model complex diseases. The alternative terms of parametric and nonparametric analyses are regarded as less accurate because some mathematical parameters are generally used in model-free analysis.

Model Organism

An organism that is widely studied by geneticists not because of its pathogenicity or utility but as a genetic “model” for higher organisms. Model organisms are generally common, small, and tractable, and have short life cycles: thus, the nematode word, Drosophila, Arabidopsis and the common laboratory mouse are all model organisms. By 2004, the genomes of most organisms commonly used as models had been made publicly available.

Module

Domains within proteins may also be referred to as modules. This terminology is most often used of domains that are relatively small, that are present in many protein families with different functions, and that can occur multiple times in the same protein. The immunoglobulin, SH2, and SH3 domains are examples of domains with these properties.

Molecular Clock

The molecular clock hypothesis is the assumption that evolution occurs at the same rate along branches of a phylogenetic tree that emerge from the same node – that is, that branches of a tree that share a common node will be of the same length. It is often a reasonable assumption, particularly if the sequences are closely related, but there are many instances where it cannot be applied because one taxon has undergone more mutations since divergence than another. This hypothesis is built into some phylogeny methods.

Molecular Dynamics

A molecular modeling technique in which the motion of a single molecule or, more often, a molecular system (such as a protein and its ligands in a “bath” of solvent molecules) is simulated. This allows a fuller exploration of conformational space than the related technique of energy minimization. Most often, the molecules are described using a simple molecular mechanics force field: nevertheless, it is very expensive in CPU time. Simulations cover times that are typically of the order of nanoseconds.

Monocistronic

A messenger RNA is defined as monocistronic if it codes for a single polypeptide chain (i.e., a single protein). An mRNA that codes for more than one protein, such as that produced from a single prokaryotic operon, is said to be polycistronic. The majority of eukaryotic mRNAs are monocistronic.

Monophyletic

In phylogeny, a taxonomic group is defined to be monophyletic if all organisms in that group are known to be descended from a common ancestor, and if all the descendants of that ancestor are included in that group. Thus, the genus Homo is classified as monophyletic because all organisms in that genus are believed to derive from a common ancestor, and no other descendants of that ancestor occur outside Homo. Taxonomists prefer to define monophyletic groups if at all possible.

Monte Carlo Algorithm

A type of numerical method that involves statistical simulation using sequences of random numbers. In bioinformatics, Monte Carlo methods are regularly used, for example, in simulating the motion of a macromolecule or complex.

Morgan

A measure for the relative distance between two genes on a chromosome, or for the frequency of recombination between two genetic markers. One Morgan corresponds to that length of chromosome in which, on average, one recombination event occurs each time a gamete is formed. Genetic distances are more usually recorded in centiMorgans (0.01 M). Source: Kahn, G, The Dictionary of Gene Technology.

Mosaicism

A type of genotype in which two cell lines with different chromosomal compositions, derived from a single fertilization, are found in a single individual. Generally, one cell line will be normal and the other contain a chromosomal aberration such as aneuploidy. The resulting phenotype depends on the proportion of abnormal cells as well as the type of aberration, and ranges from normal through minor abnormalities to malformations incompatible with life, posing serious problems in genetic counseling.

Mosaic Protein

A protein that is composed of a number of different domains (or modules). Some mosaic proteins contain very large numbers of domains. The domains that are present in many mosaic proteins are often relatively small, and some domains are found in an enormous range of different proteins with a wide variety of functions. A protein containing only 2-4 domains would not be termed mosaic: it is not a synonym for multidomain.

Motif

A (generally small) sequence of amino acids within a protein sequence, or bases between a nucleic acid sequence, that are characteristic of a particular family, a generic function or a structural pattern. Examples of protein motifs include the helix-turn-helix and the zinc finger, which both bind DNA. The smallest motifs, which can involve only 3 or 4 amino acids, represent potential locations of posttranslational modifications, The main database of protein motifs is PROSITE.

Multiallelic

A gene, or a genetic marker that has more than two forms; in contrast, almost all single nucleotide polymorphisms (SNPs) have only two base variants (e.g., a position may be A or T but not G or C) and are therefore termed biallelic. Genotyping individuals at the sites of multiallelic markers can be very useful in the mapping of genes involved with complex diseases.

Multifocal

A disease that is present at more than one site in the body (i.e., which has more than one focus) is termed a multifocal disease. Bilateral breast cancer, in which the cancer is found in both breasts, is an example. Where a disease, such as breast cancer, is heterogenous and is only sometimes multifocal, the presence of multifocal disease is one characteristic that can suggest a high genetic component and thence increase risk in blood relatives.

Multigenic Disease

A disease or deleterious trait that is caused by mutations in many genes, rather than, as is the case in monogenic disorders, by a single mutation in one gene. Many common diseases, such as asthma, some types of cancer and some forms of heart disease, are multigenic. The same disease phenotype may have many possible complex genetic causes.

Multiple Alignment

An alignment of more than two gene or protein sequences. Each row in a multiple alignment consists of a single sequence padded by gaps, with the columns highlighting similarity/conservation between positions. An optimal multiple alignment is one with the highest degree of similarity between the sequences. CLUSTAL is a commonly used public domain multiple alignment program.

Multiple Marker Screening

Any test that involves obtaining values from several different markers and combining their results to predict the most likely outcome. The term is often applied to one particular test: the measurement of alphafetalprotein and hormone levels in an attempt to detect pregnancies with a high probability of Down’s syndrome or another genetic abnormality. These tests generally give a large number of false-positive results.

Multispecies Conserved Sequence

A sequence – generally a DNA sequence -that is conserved throughout a large number of species, often highly divergent species. Highly conserved regions have been subjected to extremely strong evolutionary pressures, and therefore code for elements that are necessary for the survival of complete clades (e.g., all vertebrates).

Mutagen

Any physical or chemical agent that increases the frequency of mutations in DNA above the spontaneous background level. Mutagenic agents include ionizing radiation, UV irradiation, chemicals (e.g., alkylating agents) and nucleotide base analogs. Mutation may take place in the test tube or in vivo. Source: Kahl, G, The Dictionary of Gene Technology.

Mutagenesis

The process of introducing a change – that is, a mutation – into a DNA sequence. Mutagenesis does, of course, occur naturally, and it may be silent (produce no change in the resulting protein). However, the term is most often used to indicate an artificially induced change. Point mutations are introduced into protein sequences via site directed mutagenesis. The term is also used for methodologies used to create strains of transgenic mice (e.g., gene-trap mutagenesis).

Mutation

Any compositional change in a DNA sequence that is not caused by normal segregation or genetic recombination. Mutations may involve base changes (giving rise to single nucleotide polymorphisms), insertions or deletions; they may occur in coding or noncoding sequence. Mutations in coding sequence will lead to a change in the protein sequence unless the change is to a synonymous codon; mutations in noncoding sequence may have phenotypic consequences if they change the expression patterns of genes.

m/z Ratio

The ratio of the mass of a molecular ion to its charge. This is the quantity by which particles are sorted in a mass spectrometer in the separation experiments that are key to peptide identification in proteomics (most usually by time of flight). It is, therefore, important to work out the charge of each species if its mass – by which it is identified – is to be calculated correctly.

N-glycosylation

One of two types of linkage between the side chain of an amino acid within a protein and a simple sugar or oligosaccharide. The sugar moiety is attached to the protein chain via a covalent bond between N-acetyl-D-glucosamine and a nitrogen atom in the side chain of an asparagine (N) residue which must lie in the context of the simple motif N-X-S/T).

Needleman-Wunsch Algorithm

A dynamic programming method of aligning pairs of sequences that produces a global alignment between the whole sequences. The alignment score is defined as the sum of the scores at each individual position; the sequences are moved and gaps introduced to maximize the total score along the sequence lengths. Gap penalties are often, but not always, applied to gaps at the end of sequences. This method is used in, for example, the EMBOSS global alignment program, NEEDLE.

Network

Complex relationships between entities may be represented using networks of connections. Each entity (a gene or protein) is represented as a point, or node, in the network, and the relationships between them are represented by lines joining nodes (known as edges). Graph theory is used to classify and cluster the nodes in a network, discovering relationships that may not be visible from a simple examination of the raw data.

Neural Network

A programming methodology often used in bioinformatics for predicting features from sequence data (e.g., predicting genes from DNA sequences or protein structure from amino acid sequences). A neural network program consists, simply, of a series of “neurons” that read data (e.g., sequences) and pass information about that data as signals to other neurons; the final neuron makes the prediction. Sequences containing the known features must first be used to “train” the network.

Node

In graph theory, the objects are known as nodes, and they are connected by lines indicating relationship; these are edges. The edges may or may not have directionality, depending on the relationship modeled. Graph theory has many applications in bioinformatics, where is used to cluster the most closely related objects (typically genes or proteins) together. Nodes are most often used to represent individual genes or proteins, connected by relationships such as “is coexpressed by” (for genes) or “interacts with” (for proteins).

Nonsense Mutation

Any mutation that converts a sense codon (coding for an amino acid) into a stop codon (TTA, TAG, or TGA in the standard code), or, conversely, a stop codon into a sense codon. This leads to the production of a polypeptide chain that is either truncated or extended, and, consequently, the function of the protein will be either severely limited or completely abolished.

Nonsynonymous

A base change (mutation) is described as nonsymonymous if it occurs in coding DNA and gives rise to a change in the amino acid that is coded for: thus, a change from T to A that changes the codon CAT to CAA, and thus changes the amino acid histidine to glutamine in the resulting protein is synonymous, whilst a T-A change that changes CCT to CCA is synonymous as both codons code for proline. Nonsynonymous changes are self-evidently more important in evolution than synonymous ones; they are also less common in coding DNA.

Normalization

The equalization of the concentrations of transcripts present in a cell at extremely different levels, balancing the unequal representation of the messages in a cDNA library (which often vary by more than 5 orders of magnitude) by reducing the number of highly expressed mRNAs and enriching rarely expressed message.

Northern Blotting

A gel blotting technique in which RNA molecules, separated according to size by agarose or polyacrylamide gel electrophoresis, are transferred directly to a filter by electric or capillary forces. Single-stranded nucleic acids may be fixed to the filter by baking and are thus immobilized. Hybridization of single-stranded probes to the immobilized RNAs allows the detection of individual RNAs out of complex mixtures.

Nuclear Magnetic Resonance

An analysis technique in which molecules are identified, and molecular structures detected, by monitoring signals generated by certain atomic nuclei (those of nonintegral spin, most often protons) in oscillating high magnetic fields. Two-dimensional nuclear magnetic resonance (2D NMR) is often used for determining protein structures. This technique has the advantage of generating the structures of proteins in solution, but the disadvantage that it can only be used with relatively small proteins.

Null Hypothesis

In statistical analysis, a hypothesis is chosen at the beginning of an experiment; the objective is to collect enough data to prove or disprove that hypothesis. The null hypothesis states that a condition, for example, that a given proportion of the data has a particular value or range of values, (or will not) be met; the objective of the test is to accept or reject that hypothesis.

O-glycosylation

One of two types of linkage between the side chain of an amino acid within a protein and a simple sugar or oligosaccharide. The sugar moiety is attached to the protein chain via a covalent bond between N -acetyl-D-galactosamine and the hydroxyl group of a serine or threonine residue in most protein, or of a (nonstandard) hydroxylysine residue in the protein collagen.

O-mannosylation

The transfer of a mannose residue to dolichyl activated man-nose to serine or threonine residues of secretory proteins, catalyzed by protein O-mannosyltransferases. Mannosylation was first observed in fungi, but man-nosyltransferase orthologs have now been identified in the genomes of higher eukaryotes.

Object Oriented Programming

A programming paradigm, adopted in the programming language C and more modern languages influenced by it including C++ and Java, in which data types are defined as objects. An object includes both data and the operations (functions) that can be applied to it. Most programming languages frequently used in bioinformatics are fully object oriented; one exception is the popular and easy to learn scripting language, Perl.

Obligate

Able to live only in a particular set of conditions; that is, an obligate parasite is unable to survive and reproduce outside its host. The bacterium Chlamydia trachomatis is an obligate intracellular human pathogen that is unable to reproduce outside human cells.

Oligomer

A relatively small number of molecular units joined or associated together. These may be covalently bonded, as in nucleotides (to form an oligonucleotide) or amino acids (to form an oligopeptide or, simply, a peptide) or noncovalently associated, as in several protein chains forming a functional protein complex. Associations of two, three and four units are termed dimers, trimers and tetramers respectively. The DNA double helix is, therefore, a noncovalently bonded dimer.

Oligonucleotide

A short segment of nucleic acid, which may be single- or double-stranded. The term is generally used for segments containing up to 100 nucleotides or base pairs. The short form “oligo” is almost always used informally by experimental molecular biologists. Oligos may consist of deoxy- or ribonu-cleotides, or of a mixture of the two.

Oligosaccharide

A molecule made up of a relatively small (say 10-100) number of sugar units (=monosaccharides), joined together by condensation reactions to form linear or branched chains. Oligosaccharides are frequently attached to protein molecules to form glycoproteins. Larger numbers of monosaccharide linked together form polysaccharides (also termed complex carbohydrates).

Oncogene

Genes that control normal cellular growth and development are known as proto-oncogenes. In normal cells, these are kept under tight control, so growth and development signals are only sent when required. When an oncogene is mutated (by point mutation or simply gene amplification), it can become altered so its protein product is always activated, so growth/division signals are always sent. Uncontrolled cellular growth and development is the hallmark of cancer; the altered proto-oncogene is known as an oncogene.

Ontology

In computer science and allied fields, the word ontology – defined philosophically by Aristotle as “the science of being qua being – is used to describe a strict conceptual schema of data or concepts within a given domain. This has been applied to the derivation of structured, consistent vocabularies in different areas of knowledge, including the life sciences. The most well known ontology in the molecular life sciences is undoubtedly the Gene Ontology (http://www.geneontology.org).

Open Reading Frame Trivially, a region of genome sequence that starts with an initiation (START) codon and ends with a termination (STOP) codon, and so is translated into protein. A scan of a genome sequence for long ORFs is the first and easiest stage of gene prediction. In practice, the situation is much easier in prokaryotic genomes than in eukaryotic genomes, which are complicated by the extreme length of some genes, the presence of introns, and the necessity of identifying splice sites.

Open Source

A software product that is not only deliberately given away free, but where the code is made freely available and where modification is not only allowed but encouraged, is termed open source software. It may be protected by agreements that are analogous to the way copyright laws work in the commercial sector; one of these is known as “copyleft”. Examples of open source software products include the Linux operating system for PCs and the general bioinformatics package EMBOSS (European Molecular Biology Open Software Suite).

Operator

The stretches of prokaryotic genome sequence, adjacent to the promoter regions of genes, that regulate gene expression by binding proteins. The first regulatory mechanism to be understood was that of the lactose operon: here, it is the binding of the lac repressor to the operator region that prevents the attachment of RNA polymerase and therefore gene expression.

Operon

Operons are only found in prokaryotes. They are series of genes, normally functionally related, that are adjacent on the bacterial chromosome, are under the control of a single promoter, and are synthesized into a single, polycistronic mRNA that is translated into the constituent proteins.

ORESTES

Normally, ESTs are derived from the 3′ and the 5′ ends of cDNAs, and fragments from the centre of transcripts are underrepresented in EST libraries.

ORESTES is a novel technique for generating ESTs that preferentially amplifies the central portion of transcripts, and which can therefore be used to add many novel sequences to EST databases. It involves the amplification of the expressed gene transcripts by reverse transcription-PCR using arbitrarily chosen primers.

Origin of Replication

The sequence or region on a DNA strand or chromosome where replication begins – that is, the replication-initiation focus. In eukaryotes, the segment of DNA that is under the control of one replication-initiation focus, and which therefore acts as an autonomous unit during replication, is termed a replicon.

Orphan Gene

A gene that does not have any known orthologs in any other species – that is, a gene that is, as far as is known, found in one species only. Generally, the function of an orphan gene is unknown. The term is also applied to open reading frames (ORFs) that are not (yet) validated genes, hence the alternative term ORFan. Of course, it is possible that a gene that is thought to be an orphan gene may not be because its homologs are distant enough to be undetectable at the sequence level or because all its orthologs are in genomes that have not yet been sequenced.

Ortholog

Two homologous (evolutionarily related) genes are defined to be orthologous (i.e., they are orthologs of each other) if they are essentially the same gene, with the same function, in different organisms. Thus, human hemoglobin, mouse hemoglobin, and sperm whale hemoglobin are orthologs.

Outdegree

In graph theory, the outdegree of a node in a directed graph is the number of edges that start at that node. This is often applied to the analysis of gene networks derived from microarray experiments, where the relationship denoted by an edge is that one gene affects the transcription of another. A gene with a high outdegree is one that affects many others, that is, which is a central regulator of the network. Experiments with yeast microarrays have found that most of the genes with high outdegree are transcriptional regulators.

Outgroup

A sequence (or group of sequences) included in a phylogenetic analysis precisely because it is known to be more distantly related to the other sequences than any of them are to each other. The outgroup will diverge from the other sequences near the root of a rooted tree. Outgroups are useful as external references, and including one may lead to more accurate ordering of the other sequences.

Pair Potential

In molecular mechanics calculations or (much more often) molecular dynamics simulations, parameters to be used in equations defining the non-bonded interactions between different types of atom. Each pair of atom types will have a different pair potential. The parameters are inserted in a standard equation defining nonbonded interactions (e.g., the Lennard-Jones or the Buckingham potential equation).

Palindrome

In the study of language, a palindrome is a word or sentence that reads the same forward as backward, but that nevertheless makes sense: one example is the word MADAM. In genetics and genomics, the word is used analogously to mean a sequence of DNA where identical sequences run in opposite directions, so each strand reads the same in the 5′ to 3′ direction. Palindromic DNA sequences can be the target for DNA binding proteins and they often occur in regulatory regions of DNA.

PAM Matrices

One of the two most widely used sets of matrices that hold data on the evolutionary distance between amino acids (i.e., the probability that a substitution of one amino acid by another will be accepted), the other being the BLOSUM matrices. PAM stands for “point accepted mutation” although “accepted point mutation” would be clearer. The PAM 1 matrix is the substitution matrix for a situation where exactly one mutation has occurred per 100 amino acids. The most widely used matrix is PAM 250, which corresponds to approximately 20% identity between the sequences.

Panmixis

Simply, random mating – that is, sexual reproduction where the choice of mates is not influenced by their genotypes. The word is derived from the Greek word mixis (mixture).

Paralog

Two homologous (evolutionarily related) genes are defined to be paralogous (i.e., they are paralologs of each other) if they have different (although almost always related) functions. They may or may not occur in the same genome: paralogs that occur in the same genome will have evolved through gene duplication. Thus, human hemoglobin and human myoglobin are paralogs, but so are human myoglobin and sperm whale hemoglobin.

Pattern Recognition

Trivially, any tool or technique for recognizing patterns in sequences. The technique of pattern recognition is most often applied to protein function detection, as short groupings and/or more complex patterns of amino acids often have implications for the function of the protein. The database PROSITE contains data on many hundreds of amino acid patterns that have been associated either with protein families, functions, or (for the shortest patterns) posttranslational modifications.

Pedigree

Simply, a chart or diagram showing the relationships within a human (or model organism) family that can be used to study the inheritance pattern of an allele, marker, or disease. Large pedigrees over many generations and within relatively isolated populations, such as those studied in Decode’s Icelandic genome project, have been used to map the loci and alleles involved in complex diseases.

Penetrance

A gene is said to have high penetrance if the properties that it codes for will always or almost always be present in the phenotype, and low penetrance if the amount to which it is observed in the phenotype is more dependent on environmental variables. Thus, the CFTR gene has higher penetrance than the BRCA1 gene because a mutation in CFTR will almost always cause cystic fibrosis, whereas one in BRCA1 only increases the lifetime chances of contracting certain cancers.

Next post:

Previous post: