Biomedical Engineering Reference
In-Depth Information
cases information gleaned from articles is already known in some part of
an organization, but its relevance to some other new project is not
discovered until after the fact. For organizations such as pharmaceutical
companies, as the volume of information grows, the problem of 'knowing
what we already know' is set to become worse [6].
At the heart of this problem lies a simple truth. We are now completely
reliant on machines to help process the knowledge we generate; but
humans and machines speak radically different languages. To perform
effi ciently, machines require formal, unambiguous and rigorous
descriptions of concepts, captured in machine-readable form. Humans,
on the other hand, need the freedom to write 'natural language', with all
its associated ambiguities, nuances and subtleties, in order to set down
and disseminate ideas. The trouble is, other than trained 'ontologists',
mathematicians, and logicians, very few scientists can turn complex
thoughts into the kind of formal, mathematical representations that can
be manipulated computationally; and very few ontologists,
mathematicians, and logicians understand biology and chemistry in
enough detail to capture leading-edge thinking in the pharmaceutical or
life sciences. For the time being, at least, we are faced with the challenge
of bridging this divide.
Text- and data-mining techniques have made some progress here,
attempting to extract meaningful assertions from written prose, and to
capture these in machine-readable form (e.g. [7]). But this is far from
being a solved problem - the vagaries of natural-language processing,
coupled with the complex relationships between the terminology used in
the life sciences and records in biomedical databases make this a non-
trivial task.
In recent years, in what we consider to be a frustrated and frustrating
distraction from the real issues, many have pointed the fi nger of blame at
the fi le format used by publishers to distribute scientifi c articles: Adobe's
PDF. It has variously been described as 'an insult to science', 'antithetical
to the spirit of the web', and like 'building a telephone and then using it
to transmit Morse Code', as though somehow the format itself is
responsible for preventing machines from accessing its content [8].
Although it is true that extracting text from PDFs is a slightly more
unwieldy process than it is from some other formats, such as XHTML,
to accuse the PDF as the culprit is to entirely miss the point: the real
problem arises from the gulf between human- and machine-readable
language, not merely from the fi le-format in which this natural language
is being stored. It is worth setting the record straight on two PDF-related
facts.
￿ ￿ ￿ ￿ ￿
 
Search WWH ::




Custom Search