A posttranslational modification of cysteine residues in proteins, involving the addition of a nitro group to the free thiol. Like phosphorylation, it is thought to represent a mechanism for reversible posttranslational regulation of protein activity and consequently of cellular function.
A network can be described as scale free if the number of connections at each node is distributed very unevenly, that is, if there are a small number of very highly connected nodes. In these networks, the probability that a given node is connected by a given number of connections is determined by a power law. The highly connected nodes are termed the hubs of the network. Many examples of scale free networks can be taken from bioinformatics and other disciplines. Gene networks can be scale free if they contain genes with particularly high numbers of connections.
A term used in several disciplines within computer science to mean a model. For example, in database theory, the schema is the structure of the database,and in XML, the XML schema defines the structure of the XML documents. The term may also be used to mean a specialized type of ontology.
The RNA component of small cytoplasmic nucleoproteins (scRNPs), which are found in the cytoplasm of eukaryotic cells. These scRNPs are RNA-protein complexes that are involved in the splicing of nuclear percursor RNA after its transport into the cytoplasm. They are released from the mRNA before it is translated into protein.
The most commonly used method for separating proteins by molecular mass; one of the two methods routinely used in two-dimensional protein separation. The protein mixture is loaded with a detergent, most often sodium dodecyl sulphate (SDS); this denatures them and confers a negative charge that is proportional to their molecular mass. Migration of SDS-protein complexes on polyacrylamide gels will therefore depend largely on molecular mass.
A signal transduction protein is termed a second messenger if it passes a signal on; that is, second messengers relay signals received by cells following the activation of cell surface receptors to their target molecules. Second messengers also amplify the strength of the signal received. There are three main classes of second messenger: cyclic nucleotides, inositol triphosphate and diacyl glycerol, and calcium ions.
The second level in the hierarchy of protein structure, describing short stretches of the polypeptide chain that have regular backbone geometry and patterns of main chain-main chain hydrogen bonding. There are two common types of secondary structure, the alpha helix and the beta strand: beta strands associate into sheets, and these sheets – and certain types of tight turn between beta strands – are sometimes included in this structural category.
A duplication of a relatively large segment of DNA within a genome sequence is termed a segmental duplication. A high degree of sequence identity within the duplicate regions indicates that the duplication event occurred relatively recently in evolutionary history. The human segmental duplication database contains all duplicates that are greater than 1kb in length and share greater than 90% sequence identity.
The process by which chromosomes separate during meiosis and mitosis and migrate toward opposite ends of the cell is termed segregation. Segregation occurs during late metaphase and anaphase; the daughter chromosomes are drawn toward the ends of the cell by the microtubules.
Any change in the environment of a species that leads to some variants being more favored than others, and which therefore leads to those variants surviving and reproducing in greater numbers, is a selective pressure on that species. A classic example, often described in elementary texts, is the rise inpollution during the Industrial Revolution, which increased the chances of survival of dark pigmented moths.
An evolutionary event in which a favourable mutation is incorporated into the genome of a species so that it becomes the dominant variant so quickly that alleles that are linked to that mutation also become incorporated into the genome. It can therefore be difficult to identify the allele that is the original target of the selective sweep. The linked genes can be said to have been “hitch-hiked” into the genome.
A value that indicates how successful a test is in selecting mismatches (=negatives) from a sample set. It is calculated as the ratio of true negatives (TN; samples that do not have the feature tested for and also fail the test) to all those that do not have the feature (false positives and true negatives); thus specificity = TN/(TN+FP). Specificity ranges between 0 and 1; a perfect test will have a specificity of 1.
A value that indicates how successful a test is in selecting matches (=positives) from a sample set. It is calculated as the ratio of true positives (TP; samples that have the feature tested for and also pass the test) to all those that have the feature (true positives and false negatives); thus sensitivity = TP/(TP+FN). Sensitivity ranges between 0 and 1; a perfect test will have a sensitivity of 1.
In proteomics experiments, the first step is very often the separation of a protein mixture by mass and/or charge. The protein sample must be dissolved in a compound, such as a gel, before separation can take place; this is the separation matrix. Polyacrylamide is most commonly used (hence the term 2D-PAGE, or 2-G polyacrylamide gel electrophoresis) but others, including polyethylene oxide and hydroxycellulose, have also been used.
A pattern of residues within a protein that is associated with, for example, a particular functionality or a type of posttranslational modification. Sequence signatures may be long or short, simple or complex, and based on regular expressions, weight matrices, profiles or hidden Markov models. There are many databases containing information on sequence signatures, some of the best known being PROSITE, Pfam, Prints and Smart, and there is one meta-database, Interpro, that collects together information from these and other source databases.
The alignment of the sequence of one protein with the structure of another, which may be either homologous or analogous. Sequence-structure alignment is computationally a much harder technique than sequence-sequence alignment. This method is used prior to protein structureprediction only when no structures of proteins that are clearly recognizable as homologs at the sequence level are available.
The gap between the number of proteins of known sequence and the number of proteins of known structure, that is, the number of known proteins with no experimentally determined tertiary structure. This gap is still growing, as the structure determination is not keeping place with the number of translated gene sequences coming out of genome projects. It can be narrowed by using homology modeling to predict the structures of proteins that are homologous to proteins of known structure.
A conceptual term used to describe the complete range of protein sequences and structures that have been generated by evolution. Structural proteomics (or structural genomics) program that aim to find unknown folds or solve the structures of unknown sequences without taking much account of the known or predicted function of the proteins are described as searching or filling protein sequence-structure space.
Serial Analysis of Gene Expression
A high-throughput technique for the simultaneous detection and analysis of almost all the genes that are expressed in a given cell at a given time. It is based on the isolation of a short sequence tag, a so-called “SAGE tag” or “diagnostic tag” , from a defined location within the transcript. This tag contains sufficient sequence information to uniquely identify the transcript. The tags are concatenated into a single DNA molecule for sequencing, which aids rapid identification of the tags and therefore the genes from which they are derived. The software used for gene identification with SAGE is also able to determine expression levels.
The determination of the sequence of bases in a complete genome by a method that involves the fragmentation of the target genome or its chromosomes by physical or enzymatic means, the cloning and sequencing of the resulting fragments and the reconstruction of the complete sequence by ordering the fragments. It was used in the publicly funded Human Genome Project. Source: Kahl, G, A Dictionary of Gene Technology.
The part of an amino acid within a protein that is covalently bonded to the alpha carbon atom and, therefore, not part of the continuous main chain of the protein. The genetic code can code for 20 amino acids with different side chains; some other variants can be created by posttranslational modification. The chemical nature of amino acid side chains determines the biochemical properties of the amino acids and, thence, the function of the proteins that they are built from.
A sequence of generally 15-30 mainly hydrophobic residues at the N-terminal end of a protein. Its function is to target the protein to, and then across, the cell membrane. The protein will then be cleaved at the end of the signal peptide, releasing the mature protein from the cell. There are often positive charges at the far N-terminus of the sequence and negative ones just C-terminal of the cleavage site. Many reliable programs for the recognition of signal peptides and consequent prediction of cellular location are available.
A general term for a program for searching a DNA or protein sequence database for sequences that are similar to a test sequence, such as BLAST or FastA. Similarity searches are often termed homology searches, but this is a misnomer: the programs do not explicitly determine homology and this must be inferred by the expert user. There is always a “grey area” where sequence similarity may or may not be statistically significant.
A technique used in simulation of macromolecular (or any molecular) structure in which the molecule is “heated up” (i.e., its kinetic energy is increased) during a molecular dynamics simulation. The increase in kinetic energy allows the structure to cross energy barriers. The temperature is then reduced to more physiologically appropriate ones. Repeated simulated annealing experiments allow the molecule to sample more of the available conformational space, increasing the chances of approaching the global energy minimum for the structure.
Single Nucleotide Polymorphism
Strictly, a type of polymorphism in which there is a base change at a single position only (e.g., a single A is changed into T, C, or G with the surrounding sequence left unchanged). SNPs occur in coding and noncoding regions, and coding SNPs may be silent (i.e., the codon change does not affect the coded amino acid). Sometimes, small insertions and deletions are included in the same category as SNPs. There are estimated to be 3-30 million SNPs in the human genome.
A technique for introducing single amino acid changes into a protein by making specific changes to single base pairs at specific sites in a target DNA. Often used to probe the effects of small changes on the structure or function of a protein.
A dynamic programming method of aligning pairs of sequences that was adapted from the Needleman-Wunsch method to produce local alignments between the whole sequences. The output will be one or more high scoring sequence segments, with or without internal gaps; gaps at the end of sequences will be removed. The order of matching regions may differ between the sequences. Local alignments and methods that depend on them are often used to identify conserved domains. This method is used in, for example, the EMBOSS local alignment program, WATER.
A class of small RNA molecules that are involved in chemical modifications of other RNA genes. They are a component of the small nucleolar ribonucleoprotein complex (snoRNP), which also contains protein. The snoRNA guides the snoRNP complex to the modification site of the target RNA gene via snoRNA sequences that hybridize to the target site.
An abundant class of relatively small, uridine rich RNA molecules, 100-300 nucleotides in length, which are associated with small nuclear ribonucle-oprotein particles. These are found in the nucleus and needed for RNA splicing. The U-RNA family is a strongly conserved family of snRNAs, and its members are designated U1-U10.
Solvent Accessible Surface
In computer-based molecular modeling, a surface around a molecule that is created by virtually rolling a “probe” , usually the size of a water molecule, around the molecule in direct contact with it and plotting the trajectory of the center of the probe. It is larger and “smoother” surface than that built using atomic van der Waals radii and includes those parts of the molecule that are accessible to solvent.
Any eukaryotic cell other than a germ cell – that is, any cell that is normally diploid. The term somatic gene therapy, or somatic gene therapy, is used for any process by which the genomes of somatic cells are artificially altered. This is safer than altering germ cells, and raises fewer ethical questions, as the germline of the individual is not altered.
A well-known method for the detection of specific DNA fragments. The DNA fragments are first separated using agarose gel electrophoresis; then the separated fragments are blotted onto nitrocellulose paper. Labeled cDNA probes are hybridised onto the separated bands, which can then be viewed using autoradiography. The similar technique of Northern blotting is used to detect sequences in RNA.
Space Charge Effect
An effect that limits the current in a beam of ions of like charge, such as that in a mass spectrometer, arising from mutual repulsion between the ions. Space charge is a consequence of Coulomb’s Law, which states the repulsion between particles of like electrostatic charge. In practice, the space charge effect leads to an expansion in radius of the charged ion beam.
The evolutionary process by which a branch of the Tree of Life bifurcates to form two distinct species. A species is defined as a population of individual organisms that share extremely similar phenotypes and genetic makeup. Where sexual reproduction occurs, male and female individuals of the same species must be able to produce fertile offspring. In prokaryotes, the basic definition of speciation is confused by horizontal gene transfer.
Many procedures in sequence analysis (e.g., hydropathy profiles, dotplots) involve defining a stretch of contiguous residues, calculating a given property and then recalculating the same property for each stretch of residues along the entire sequence. This is known as defining a “window” that is “slid” along the sequence.
Splice Acceptor Element
A segment of DNA that, if included in a vector upstream of a gene sequence, will be read as a splice acceptor signal (intron-exon boundary) and thus enable that gene to be transcribed if it is inserted within an intron of a transcribed gene. Splice acceptor and donor elements are used, for example, in the technique of gene trap mutagenesis, in which mutated genes are inserted into the mouse germline.
Splice Donor Element
A segment of DNA that, if included in a vector downstream of a gene sequence, will be read as a splice donor signal (exon-intron boundary). If a splice donor site is incorporated without a poly-A sequence the gene will only be transcribed if a poly-A signal can be obtained from the endogenous gene. Splice acceptor and donor elements are used, for example, in the technique of gene trap mutagenesis, in which mutated genes are inserted into the mouse germline.
Site In eukaryotic genes containing introns, the junction between an exon and intron at the 3′ end of the intron (splice acceptor site) and the junction between the intron and the exon at the 5′ end of the intron (splice donor site). Both types of splice sites may be identified from consensus DNA sequences.
A method of gene finding from a DNA sequence and a set of candidate predicted exons, by searching the set of possible exon chains for the one with the best fit to a related protein sequence. The original exon set is constructedby considering all possible donor and acceptor splice sites in the genomic sequence. Although this gives an enormous number of candidate exons, most of which will be false positives, the method can be very fast.
If certain gene or protein sequence patterns are more frequently found in a longer sequence than they would statistically be expected to be, they are said to be statistically overrepresented in the longer sequence. The opposite phenomenon is termed statistical under-representation. Sequences that are overrepresented may be functionally important whereas underrepresentation indicates that that particular (possibly functional) motif may have deleterious consequences to the organism.
A type of cell that has the potential to differentiate into any one of many different types of cell. Stem cells potentially have very important applications in therapy, particularly for degenerative diseases. The most effective stem cells are embryonic stem cells, derived from early embryos, but the use of these is extremely controversial. Stem cells may also be derived from the umbilical cord or produced from certain types of ordinary cell using chemicals.
In general terms, the extent to which errors or mismatches are tolerated in a detection experiment or a bioinformatics calculation. Thus, in sequence analysis, stringency is used to set the number of mismatches allowable in a sequence segment defined as a hit (e.g., one that generates a dot in a dotplot). Similarly, a PCR reaction that is relatively tolerant of errors in the resulting sequences is defined as a low stringency reaction.
Any experimental program for solving protein structures by X-ray crystallography or nuclear magnetic resonance that may be described as high throughput, aiming to solve a large number of different structures in a short time, may be termed a structural proteomics (or, confusingly, structural genomics) program. There are two main approaches: either experimentalists concentrate on solving proteins from a particular bacterium or involved in a particular disease, or they attempt to increase the coverage of sequence-structure space by picking proteins predicted to have unknown folds.
A method of reducing the chance of finding spurious associations between genes and disease in the large populations that are necessary for the study of the genetics of complex diseases, caused by population heterogeneity. In structural association methods, the details of the population substructure are inferred during the early stages of association testing, so they can be taken into account during the rest of the analysis.
In simple terms, a part of a cell, such as the nucleus, the cytoplasm, or the cell membrane. Bioinformatics tools may be used to predict,with reasonable accuracy, the subcellular compartment or compartments in which a protein is most likely to be found: to take a simple example, proteins containing hydrophobic segments of a certain length are likely to be found embedded in the cell membrane.
Graph theory is used in bioinformatics, for example, to analyze the expression patterns of a group of genes. In graph theory, “A subgraph of the graph G is defined as a graph whose vertex set is a subset of the vertex set of G, whose edge set is a subset of the edge set of G, and such that the map w is the restriction of the map from G.”
The genetic heterogeneity in large populations that causes problems such as spurious association (false positives) in population-based studies of genetics of complex diseases is generally known as population substructure or stratification.
Phylogenetic trees of prokaryotic species based on single genes are limited by, among other factors, the amount of horizontal gene transfer between species. It is considered that phylogenetic trees derived from many genes or even whole genomes may reconstruct the prokaryotic “tree of life” more accurately. The supertree approach to combining phylogenetic information combines single-gene trees rather than sequence alignments. It can be used to combine trees that share only a few species, and it can be used where whole-genome sequences are not available for all species.
Regions of genomes (generally from quite closely related species) that share at least gene content and often gene order are said to exhibit synteny. Often gene order will be partly disrupted by gene loss, inversion and duplication events. In eukaryotic genomes, large chromosomal regions are conserved throughout chromosomal rearrangements, so regions of one chromosome will be syntenous with parts of different chromosome from related species.
Any one of a number of disparate techniques to study, and specifically to model using mathematics and engineering techniques, the various components of a biological system as the integrated system, for example, modeling a cell using its component molecules, or an organ or tissue using its component cells. It may also be thought of as a mathematical way of thinking of physiology.
Single nucleotide polymorphisms (SNPs) that are known to be associated with haplotype blocks (DNA segments located between recombination hot spots that are usually inherited as blocks). It is possible to genotype individuals for susceptibility to complex diseases using fewer SNPs in total if only the known tag SNPs are used.
Tandem Mass Spectrometry
A method for separating and identifying proteins using mass spectrometry (MS) alone, without an initial electrophoresis step. Ituses two mass spectrometry steps, hence the term “tandem MS”. The first MS step separates a single protein ion from a mixture. The second step fragments the protein into a series of peptides and analysis of the fragmentation pattern gives ride to short sequence fragments that can be used to identify the protein. Often, electrospray is used for the first ionisation: this technique is known as ESI-MS-MS.
In gene sequence analysis, the arrangement of two or more identical sequences within a DNA molecule so that they are close neighbors. These can either be direct (head-to-head) or indirect (head-to-tail), in which case one of the sequences is reversed. The term may also be used to refer to two or more chromosomal segments that are arranged as close neighbors within the chromosome.
An enzyme (EC 184.108.40.206) from the thermophilic eubacterium Thermus aquaticus (strain YT 1 or BM), which polymerizes deoxynucleotides with little or no 3′-5′ or 5′-3′ exonuclease activity. It is thermostable (optimum temperature 70-75°C) and allows the selective amplification of any cloned DNA about 10 million-fold with high specificity and fidelity. It is also used to label DNA fragments with radioactive nucleotides, biotin or digoxygenin, and it can be used in Sanger sequencing.
An AT-rich DNA region with the consensus sequence TATAT/AAT/A (in plants, the consensus is TATAATA) most frequently located a few tens of base pairs upstream of the transcription initiation site of eukaryotic genes. It represents the transcription factor binding site; it is essential for accurate initiation of transcription, but not necessary for quantitative expression. It is not found in the promoters of most constitutively expressed (“housekeeping”) genes. Source: Kahl, G, The Dictionary of Gene Technology.
In phylogenetic analysis, each individual sequence (or species) represented in a phylogenetic tree is described as a taxon (plural: taxa). Each taxon is a phylogenetically distinct unit and appears on the tree as a point at the top level of the tree (so following the analogy further, the taxa are the leaves).
The ends of eukaryotic chromosomes are known as telomeres. They are usually gene poor, and the telomere tips contain highly repetitive DNA sequences. Telomeres preserve the integrity of chromosomes during replication, and their length tends to decrease as the age of the organism increases. Telomere elongation is catalyzed by the enzyme telomerase.
The final phase of the cell cycle, during which nuclear membranes reform around each collection of daughter chromosomes and the cell divides in two.
In mitosis, the result is the division of the original cell into two identical daughter cells. Meiosis involves two cell divisions, resulting, after the second telophase, in four haploid gametes, each containing a single copy of each chromosome.
Tentative Consensus Sequence
A consensus sequence of amino acids or (more often) of nucleotides which is inexact, that is, where not every position can be completely characterized. Tentative consensus sequences may, nevertheless, be used to search databases, using programs such as Scansite. Tentative consensus sequences are often used to characterise gene promoter regions.
The structure or fold of a single protein (or polypeptide) chain and, originally, the third level in the hierarchy of protein structure, between secondary and quaternary structure. Now it is widely known that protein chains may fold into one or many domains and that each domain may be assigned to a different fold category. The terms “supersecondary structure” and “domain” have now been added to the structural hierarchy between the secondary and tertiary levels.
A cell is defined as tetraploid if it contains four copies of each chromosome – that is, twice the genetic content of a normal diploid cell, or four times that of a haploid gamete. Mosaic tetraploidy (where only some cells are tetraploid) is quite common in preimplantation diagnosis but very rare in implanted embryos and fetuses. It is not clear whether this is because the condition is embryonic lethal or whether it is, in fact, harmless due to selective growthn of normal cells. Complete tetraploidy is known to be embryonic lethal.
The presence of four chromosomes or part-chromosomes of the same type instead of two in a diploid genome. It arises following errors of segregation during meiosis. In humans, a full tetrasomy of a whole chromosome would be incompatible with life, but mosaic tetrasomies of part chromosomes (i.e., where the aberration occurs in some cells only) occasionally occur. A mosaic tetrasomy of chromosome 12p has been associated with profound mental retardation.
A method for predicting the structure of a protein from its sequence in cases where no obviously homologous proteins of known structure are available. The test sequence is “threaded” through a variety of protein fold templates and the sequence-structure match evaluated using, for example, an energy function. David Jones’ THREADER is a good example of a public domain threading program.
In any analysis where data is to be classified into two (or more) groups, but where the programs used produce numerical scores, the threshold is the score that marks the boundary between two groups. The threshold is normally set by the user and its value determines the number of false negatives and false positives that the experiment will produce (if the threshold is set too low there will be many false positives; if too high, there will be many false negatives).
Time of Flight
The most widely used type of mass analyzer in mass spectrometry, at least as that is applied to the identification of separated proteins. The peptide ions are accelerated so ions of like charge have the same kinetic energy; therefore, from basic physical principles, there will be an inverse relationship between the time taken for an ion to travel to the detector and its mass/charge ratio. This enables that mass/charge ratio to be determined as a step toward peptide and protein identification.
The arrangement and linking of a group of elements; the properties of a figure that are unchanged by continuous distortion (strict mathematical definition). In protein structure, the topology of a protein describes its overall shape and the connectivity between the elements; it is the term that is used to describe the third (Fold Family) level of the CATH protein structure classification. The term may also be used to describe the orientation of a transmembrane helix bundle in the membrane.
The interaction between genomics and toxicology, or the influence that genetic variation has on drug toxicity. Common genetic variations (SNPs) may lead to drugs causing toxic side effects in some people. For example, drugs may accumulate to toxic levels in people with less efficient variants of enzymes in the cytochrome p450 family of metabolic enzymes.
It often happens that transcription factor-DNA complexes must be stabilized by the binding of other proteins before mRNA transcription can take place. This stimulation of transcription by a transcription factor and its associated adjacent proteins binding to a promoter region is termed transactivation.
A technique of measuring and identifying the mRNA content of a cell by harvesting (or capturing) the transcripts present in that cell before they can be degraded. This method can be used to monitor gene expression within and between cell types and to identify splice variants, including those resulting from exon-skipping events.
A protein that binds to the recognition sequence of a DNA molecule, upstream of a coding sequence, and facilitates transcription initiation. DNA-dependent RNA polymerases bind to the transcription factor-DNA complex that activates RNA poymerization. Transcription factors may also bind to upstream regulatory sequences or even to sequences within the coding regions.
The use of microarray or similar technologies to determine a profile of the mRNA molecules (transcripts) present in a cell type under particular conditions and at a particular time. Very briefly, short pieces of cDNA molecules are immobilized on a grid, mRNA from the cell type under study is tagged with fluorescent probes and hybridized to the stationary cDNAs. Fluorescence at a spot indicates the presence of an mRNA that is complementary to that cDNA molecule.
The complete DNA sequence between the transcription initiation site and the transcription termination site, both sites as recognized by the DNA-dependent RNA polymerase. A transcripton may contain one gene or more than one; in the latter case the message produced is polycistronic, but only in prokaryotes is this ever translated into a single polyprotein.
By analogy with “genome” , “proteome”, and a large number of other “omes” , the set of mRNA transcripts that is present in a cell. Unlike the genome, but like the proteome, an individual organism’s transcriptome is not constant but varies according to cell type, developmental stage and conditions (e.g., a disease state or the presence of a drug). However, the correlation between the transcriptome and the proteome is not particularly strong and the proteome is, self-evidently, more indicative of the metabolic processes that are taking place in the cell.
The transmission of a signal from the exterior surface of a cell or organelle into the interior of that system, leading to an internal response to the external signal. Signal transduction is initiated by a ligand binding to a surface receptor and carried out by a cascade of enzyme activity.
The uptake of viral nucleic acid by bacterial cells or speroplasts, resulting in the production of a complete virus. Alternatively, the integration of foreign DNA into the genome of cultured animal or plant cells via direct gene transfer.
Any gene that has been transferred from one organism to another organism of a different species. The transformed organism is known as a transgenic organism. Transgenes may not be expressed, or may be expressed at very low levels, in the host organism. Transgenic modification must be strictly controlled by law.
The process of protein synthesis at the ribosome is termed the translation of the RNA sequence into protein. The sequence of the resulting polypeptide is determined from that of the original RNA molecule via the genetic code. Although one code is used almost universally, alternate codes are used in mitochondria and some groups of organisms. The process of protein translation is a complex one in which the ribosome operates as a molecular machine.
The stepwise, codon-to-codon advance of a ribosome along a messenger RNA sequence with simultaneous transfer of the peptidyl-RNA from the A site to the P site of the ribosome. Each step exposes an mRNA codon for base pairing with its specific tRNA anticodon. Alternatively, any change in the position of a specific chromosome segment either within the chromosome (“shift”) or to another nonhomologous chromosome (interchromosomal translocation). Source: Kahl, G, The Dictionary of Gene Technology.
A test of the role of genetic factors in disease states in which the genotypes of cases of a disease are compared to those of their parents to discover whether a genetic variant or marker is inherited by cases at frequencies higher than would be expected using classical Mendelian genetics. If the allele or marker is, in fact, transmitted in excess of what would be expected in cases of disease, it indicates that the allele is a risk factor for the disease.
Generally, any sequence or segment of DNA that can change its location within a genome. However, in the strictest usage the term “transposon” is restricted to use in prokaryotes, with similar sequences in eukaryotes being termed “transposon-like elements”. A transposon is flanked by short inverted repeat sequences; it encodes an enzyme that catalyzes its excision from its first site and insertion in a new site. Transposons can be used in the construction of certain types of cloning vectors.
The ligation of exons from two different mRNA molecules to form one messenger RNA with a different combination of coding sequences that will therefore be translated into a different protein. Much of the complexity of vertebrate proteomes arises from the formation of multiple proteins from a simple gene set using mechanisms such as this one. Source: Kahl, G, The Dictionary of Gene Technology.
Trinucleotide Repeat Expansion
A sequence of three bases that is repeated a large and variable number of times at a specific position of a chromosome (and thus, a special case of microsatellite). The repeated sequence may occur in coding or noncoding DNA; where it occures in coding DNA, it gives rise to an amino acid repeat. Several rare single gene disorders arise from an expansion in a trinucleotide repeat. The best known of these is Huntington’s disease, where the expansion is of the trinucleotide CAG in a coding region and therefore of the amino acid glutamine.
The presence of three chromosomes of the same type instead of two in a diploid genome. It arises when one chromosome fails to segregate during meiosis. In humans, most trisomies are incompatible with life, but people with trisomy 21 (three copies of chromosome 21) can lead fairly satisfying lives, albeit with the mental and physical disabilities characteristic of Down’s syndrome. Babies born with some other trisomies, including trisomy 13, may live a few months.
Tropism Tropism, in general, is the involuntary response of an organism to a stimulus. Viral tropism is the interaction between the virus and its host, and it can hamper gene therapy with viral vectors. Scientists are developing methods of modifying and decreasing tropism in viral vectors.
Tumour Suppressor Gene
Genes that code for signal transduction proteins that send signals that inhibit cell growth and division are known as tumor suppressor genes; they are, therefore, the opposite of oncogenes. When tumor suppressor genes are mutated they may lose their functionality, leading to a loss of control of cell proliferation and, potentially, the development of cancer.
The accuracy of a model of the three-dimensional structure of a protein built from the structure of a similar sequence will depend on the percentage identity between the sequences. Between about 10% and 25-30% sequence identity, the degree of homology (evolutionary relationship) between the sequences cannot be inferred from the identity value alone, although a clear evolutionary relationship may be deduced from other means. In this case, the proteins are said to fall within the twilight zone, and successful homology modeling may not be possible.
If a pre-mRNA transcript is processed by fewer splicing events than would be required to produce the correct, mature mRNAs, that transcript is described as underspliced. This frequently occurs in viral processing; for example, the HIV virus originally produces a single transcript that is processed into some 30 mRNAs. If the transcript is underspliced, fewer mature mRNAs will be produced and the daughter virions will not be able to assemble.
A genetic condition in which two copies of one or more chromosomes are inherited from one parent and none from the other parent, with the chromosome number remaining normal. The condition may be termed maternal or paternal uniparental disomy depending on the parent that provided both chromosomes. It may be silent (with the child appearing phenotypically normal); alternatively, it may result in developmental defects due to abnormal imprinting.
Toward the 5′ end of a DNA sequence. The term is most often used for sequence that is located 5′ of the coding sequence of a gene. The 5′-most region of that sequence that is transcribed into the pre-mRNA (but not translated into protein) is known as the 5′-untranslated region (5′-UTR). Control sequences that are bound by transcription factors and that are located upstream of the 5′-UTR are never transcribed into mRNA.
Portions of sequence that are transcribed into RNA, but not translated into protein. Each transcribed gene includes a sequence upstream of the start codon (leader sequence, 5′ untranslated region or 5′-UTR) and one downstream of the stop codon (trailer sequence, 3′ untranslated region or 3′-UTR). Untranslated regions contain sequences that control expression. The poly-A tail is not part of the trailer sequence as it is not part of the original gene: it is added to the 3′ end of the trailer sequence after transcription.
A plasmid or phage cloning vehicle specially constructed to achieve efficient transcription of a cloned DNA fragment and translation of its mRNA into protein. Cloning vectors often contain an expression cassette including a highly active promoter, to aid efficient gene expression. Genome sequences may be contaminated by vector sequence; this may be tested for by comparing the new genome sequence with a database of known vector sequences. Source: Kahl, G, The topic of Gene Technology.
A gene or gene cluster in a microbial pathogen which increases the virulence of that pathogen to its human or animal host. Many virulence factors have been transferred between bacterial species via plasmids. Some very pathogenic bacteria, such as Vibrio cholerae (the causative agent of cholera) have been shown to contain “systems” of toxicity comprising 40 or more protein toxins and virulence factors.
A statistical model used in sequence analysis in which each position in the sequence is modeled independently of the others. Typically, a score is allocated to each amino acid (or base) based on the likelihood of that amino acid (or base) being found at that position in the feature under consideration.
A technique, analogous to Southern blotting, for the detection of specific proteins that have been separated by 2D gel electrophoresis or a similar technique. The proteins are transferred to a membrane and visualized using specific radioactively labeled, fluorescence-conjugated or enzyme-conjugated antibodies. Source: Kahl, G, The Dictionary of Gene Technology.
In general terms, workflow is simply how a work procedure is organized. It is used in bioinformatics largely in applications that consist of a large number of relatively small or simple calculations, with later analyses chosen as a result of earlier ones. In these cases, workflow software may be used to automate at least some of the tasks and decisions. Examples might include the prediction of protein localization or immunogenicity from a sequence.
Yeast Two Hybrid
A relatively new, powerful technique for detecting interactions between proteins. It is based on the dual-module composition of yeast transcriptional activators such as GAL4. The protein under test is linked (hybridized) to the DNA binding domain of GAL4, and a library of proteins to the GAL4 activation domain. DNA transcription only occurs if there is an interaction between the test protein and one of the library proteins.
A statistical parameter defined as the difference between a score achieved for one variable in a set and the mean score for the whole set, divided by the standard deviation in the scores. A high z-score indicates that the score for a particular variable is an outlier that may be of statistical significance (e.g., indicating a match). This concept is often used in protein fold recognition, where it is used to calculate whether there is a statistically significant match between a sequence and one particular fold when it is compared to the whole database of possible folds.