Biomedical Engineering Reference
In-Depth Information
which create a local subset of PubMed data by capturing the native field definitions, such as author
name, publication title, and MESH keywords. However, these products don't support the automatic
integration of structure and sequence data with functional data. Their support for text mining of the
data within a document is limited to simple user-directed keyword search.
The most advanced NLP systems work at the semantic level—the analysis of how meaning is created
by the use and interrelationships of words, phrases, and sentences in a sentence. Unlike a typical
search engine, these advanced systems attempt to automatically populate a database with, for
example, functional genomic and proteomic data relevant to a specific gene, protein, or disease,
including rules and trends not explicitly stated or defined in the documents. These systems, which
represent the leading edge of NLP R&D, are less reliable than systems based on keyword extraction
and distribution techniques in that they sometimes formulate incorrect rules and trends, resulting in
erroneous search results.
Regardless of the level of NLP, most systems follow the basic process outlined in Figure 7-16 . Online
documents are first parsed into words, word collections, or sentences, depending on the NLP method
used. The simplest systems simply look at individual words, whereas systems that support mining of
document clusters focus on word collections to establish context. The most advanced NLP systems,
which attempt to extract meaning from words and word order, parse the documents at the sentence
level.
Figure 7-16. The NLP Process.
The processing phase of NLP involves one or more of a variety of the following techniques:
Stemming — Identifying the stem of each word. For example, "hybridized", "hybridizing",
and "hybridization" would be stemmed to "hybrid". As a result, the analysis phase of the NLP
process has to deal with only the stem of each word, and not every possible permutation.
l
Tagging — Identifying the part of speech represented by each word, such as noun, verb, or
adjective.
l
Tokenizing — Segmenting sentences into words and phrases. This process determines which
words should be retained as phrases, and which ones should be segmented into individual
words. For example, "Type II Diabetes" should be retained as a word phrase, whereas "A
patient with diabetes" would be segmented into four separate words.
l
Core Terms — Significant terms, such as protein names and experimental method names,
are identified, based on a dictionary of core terms. A related process is ignoring insignificant
words, such as "the", "and", and "a".
l
Resolving Abbreviations, Acronyms, and Synonyms — Replacing abbreviations with the
words they represent, and resolving acronyms and synonyms to a controlled vocabulary. For
example, "DM" and "Diabetes Mellitus" could be resolved to "Type II Diabetes", depending on
the controlled vocabulary.
l
Search WWH ::




Custom Search