Biology Reference
In-Depth Information
For example, in molecular biology the term “gene” is particularly vague. It may be
defined as: an ORF; a protein coding sequence (CDS); a unit of heredity; and so on.
In fact, in a large proportion of the technical literature, the term “gene” is used with-
out definition, leaving the interpretation up to the reader. If that reader is a human,
this approach is generally workable; a human will know whether the paper under
consideration deals with, for example, prokaryotes or eukaryotes, and will under-
stand the implications of this distinction for the application of the term. The fact that
prokaryotic genes frequently overlap, do not contain introns, may occur in operons,
and are not associated in linear chromosomes is part of the inferred knowledge of the
reader, and need not be stated. However, if the “reader” is a computational algorithm,
none of this knowledge can be assumed.
Structured vocabularies are a first step towards making text interpretable to com-
puters. A structured vocabulary may mandate, for example, that the term “CDS”
must always be used to mean “a sequence of nucleotides which codes for an mRNA
which codes for a protein, or part thereof”. As anyone who has ever used a Web form
with drop-down boxes knows, a structured vocabulary prevents confusion due to the
use of different terminology, or even the mis-typing of an agreed terminology.
Ontologies are far more, however, than just structured vocabularies. A structured
vocabulary ensures consistent naming of entities in a domain, but entities do not exist
in splendid isolation. Interactions between entities are the core of any complex system.
In an ontology, entities and the interactions between them are annotated, again using a
standard terminology ( Figure2.14 ). Thepresenceof these annotations,with their strictly
definedmeanings,means that computational algorithms can reasonover a dataset repre-
sentedas a graph, constructedusing anontology, extracting relationships andgenerating
hypotheses which were not previously apparent. This approach is particularly valuable
for very largegraphs,whicharehard todisplayona computer screen, andevenharder for
a human to comprehend once they are displayed. Given the necessarily complex and
redundant language used in most biology papers, ontologies are clearly valuable for
literature mining in general, and for biology in particular ( Jensen and Bork, 2010 ).
Because of the value of ontologies to the representation and analysis of large
datasets, there are multiple community-based standards organisations which aim
to define agreed standards for various domains of biology. The umbrella organisation
for this effort is the OBO (Open Biological and Biomedical Ontologies) foundry 5
( Smith et al. , 2007 ). This organisation describes itself as “a collaborative experiment
involving developers of science-based ontologies who are establishing a set of prin-
ciples for ontology development with the goal of creating a suite of orthogonal inter-
operable reference ontologies in the biomedical domain”. Any researcher is welcome
to participate in ontology development for their specific area, and at the time of writ-
ing there are eight active domain-specific ontologies and 94 candidate ontologies,
ranging from broad domains such as “cell type” to highly specific areas such as “ Dic-
tyostelium discoides anatomy”. Perhaps the most widely known ontology amongst
5 http://obofoundry.org/ .
Search WWH ::




Custom Search