Biomedical Engineering Reference
In-Depth Information
The analysis phase of NLP typically involves the use of heuristics, grammar, or statistical methods.
Heuristic approaches rely on a knowledge base of rules that are applied to the processed text.
Grammar-based methods use language models to extract information from the processed text.
Statistical methods use mathematical models to derive context and meaning from words. Often these
methods are combined in the same system. For example, grammar-based methods and statistical
methods are frequently used in NLP systems to improve the performance of what could be
accomplished by using either approach alone.
Heuristic or rule-based analysis uses IF-THEN rules on the processed words and sentences to infer
association or meaning. Consider the following rule:
IF <protein name>
AND <experimental method name> are in the same sentence
THEN the <experimental method name> refers to the <protein name>
This rule states that if a protein name, such as "hemoglobin", is in the same sentences as an
experimental method, such as "microarray spotting", then microarray spotting refers to hemoglobin.
One obvious problem with heuristic methods is that there are exceptions to most rules. For example,
using the preceding rule on a sentence starting with "Microarray spotting was not used on the
hemoglobin molecule becauseā€¦" would improperly evaluate the sentence.
Grammar-based methods use language models that serve as templates for the sentence- and phrase-
level analysis. These templates tend to be domain-specific. For example, a typical patient case report
submitted by a clinician might read:
"The patient was a 45-year-old white male with a chief complaint of abdominal pain
for three days."
A template that would be compatible with the sentence is:
<patient> <patient age> <race> <sex> <chief complaint><complaint duration>
Templates tend to work better in clinical publications than they do in basic research publications
because much of physician education involves learning a strict method of reporting clinical findings.
However, scientists involved in basic research tend to have less indoctrination in a particular way of
revealing their findings, and so the statement of findings doesn't follow a syntactic formula.
Most statistical approaches to the analysis phase of NLP include an assessment word frequency at the
sentence, paragraph, and document level. Word frequency is relevant because words with the lowest
frequency of occurrence tend to have the greatest meaning and significance in a document.
Conversely, words with the highest frequency of occurrence, such as "and", "the", and "a", have
relatively little meaning.
In one statistical approach based on word frequency, a document is represented as a vector of word
frequency, with the individual words or phrases forming the axes of the multi-dimensional space. This
vector can be compared to a library of standard vectors, each of which represents a particular
concept. Because the closeness of the two vectors represents similarity in concepts or at least
content, this method can be used to automatically classify the contents of the document under
analysis.For example, in Figure 7-17 , a document represented by a vector is compared with a vector
that represents the use of microarray spotting of the hemoglobin extracted from patients with sickle-
Search WWH ::




Custom Search