Biomedical Engineering Reference
In-Depth Information
Text Mining
For mankind to benefit from bioinformatics research, the sequence and structure of proteins and
other molecules must be linked to functional genomics and proteomics. The primary store of
functional data that links clinical medicine, pharmacology, sequence data, and structure data is in the
form of biomedicine documents in online bibliographic databases such as PubMed (see Figure 7-14 ).
Mining these databases is expected to reveal the relationships between structure and function at the
molecular level and their relationship to pharmacology and clinical medicine.
Figure 7-14. PubMed Home Page. This source of biomedicine literature
contains over 11 million citations, and has an annual growth of about 3
percent.
Text mining—automatically extracting this data from documents, which is published in the form of
unstructured free text, often in several languages—is a non-trivial task. Although computer
languages such as LISt Processing (LISP) have been developed expressly for handling free text,
working with free text remains one of the most challenging areas of computer science. This is
primarily because, unlike the analysis of the sequence of amino acids in a protein, natural language is
ambiguous and often references data not contained in the document under study. For example, a
research article on the expression of a particular gene in PubMed may contain numerous synonyms,
acronyms, and abbreviations. Furthermore, despite editing to constrain the sentences to proper
English (or other language), the syntax—the ordering of words and their relationships to other
elements in phrases and sentences—is typically author-specific. The article may also reference an
experimental method that isn't defined because it's assumed as common knowledge in the intended
readership. In addition, text mining is complicated because of the variability of how data are
represented in a typical text document. Data on a particular topic may appear in the main body of
text, in a footnote, in a table, or imbedded in a graphic illustration.
Natural Language Processing
 
 
Search WWH ::




Custom Search