Biology Reference
In-Depth Information
Software Availability
JavaNNS: http://www.ra.cs.uni-tuebingen.de/software/JavaNNS/ .
5.6 Ontologies and text mining
Publically available databases are an unparalleled resource for the microbiology
community. They have, however, a number of problems. As already mentioned,
much of the data in these databases are generated using high-throughput technolo-
gies, and significant amounts of the data have not been subjected to human curation.
The data are noisy and incomplete, and often the proportions of false positive and
true positive identifications and interactions are unknown. Further, the data are pre-
sented without context; we may know that a protein-protein interaction has been
identified using a yeast two-hybrid approach, but we generally do not know what,
if any, hypothesis the experiment which generated the data was designed to inves-
tigate or, perhaps most importantly, how the originating researcher interpreted the
results. The most reliable data comes,
indisputably, from the peer-reviewed
literature.
Unfortunately, keeping up with the literature is impossible for a working scien-
tist. According to an editorial in Nature , compiled from user input acquired using the
social networking tool Twitter, 4 we have sequenced approximately 1
10 22 %of
the DNA on earth: “the fraction of microbial diversity that we have sampled to date
is effectively zero” ( Microbiology by Numbers [Editorial], 2011 ). Even so, thou-
sands of articles are published every month, and the rate of publication is increasing
exponentially ( Hunter and Cohen, 2006 ).
The title “last man to know everything” has been variously applied to a number of
people, including Thomas Young (1773-1829) ( Robinson, 2007 ), Athanasius
Kircher (1601 or 1602-1680) ( Findlen, 2004 ), Joseph Leidy (1823-1891)
( Warren, 1998 ) and, of course, Gottfried Leibniz (1646-1716) ( Fentress, 1914 ),
all of whom worked before the middle of the 19th century. “Knowing everything”
is no longer possible. Even in specialised sub-fields it is not feasible to scan all of
the relevant journals, identify papers which may be relevant to current or future
work, extract and understand the important findings, and organise the results in a
way which can be easily accessed at need. The concept of automated text mining
is therefore extremely attractive, and the application of text mining to genomics
has been an active area of research for at least 20 years ( Zweigenbaum et al. , 2007 ).
An important concept in literature mining, as in many other aspects of bioinfor-
matics (and, indeed, many other fields) is that of an ontology . An ontology is a work-
ing conceptual model of the entities which exist in a given domain, and their
interactions ( Gruber, 1993; Stevens et al. , 2000 ). The basis of any ontology is a struc-
tured vocabulary : a list of terms that must be used to describe the entities in a domain.
4 https://twitter.com/ .
Search WWH ::




Custom Search