Data Mining - Bioinformatics Computing

Biomedical Engineering Reference

In-Depth Information

The analysis phase of NLP typically involves the use of heuristics, grammar, or statistical methods.

Heuristic approaches rely on a knowledge base of rules that are applied to the processed text.

Grammar-based methods use language models to extract information from the processed text.

Statistical methods use mathematical models to derive context and meaning from words. Often these

methods are combined in the same system. For example, grammar-based methods and statistical

methods are frequently used in NLP systems to improve the performance of what could be

accomplished by using either approach alone.

Heuristic or rule-based analysis uses IF-THEN rules on the processed words and sentences to infer

association or meaning. Consider the following rule:

IF <protein name>

AND <experimental method name> are in the same sentence

THEN the <experimental method name> refers to the <protein name>

This rule states that if a protein name, such as "hemoglobin", is in the same sentences as an

experimental method, such as "microarray spotting", then microarray spotting refers to hemoglobin.

One obvious problem with heuristic methods is that there are exceptions to most rules. For example,

using the preceding rule on a sentence starting with "Microarray spotting was not used on the

hemoglobin molecule because…" would improperly evaluate the sentence.

Grammar-based methods use language models that serve as templates for the sentence- and phrase-

level analysis. These templates tend to be domain-specific. For example, a typical patient case report

submitted by a clinician might read:

"The patient was a 45-year-old white male with a chief complaint of abdominal pain

for three days."

A template that would be compatible with the sentence is:

Templates tend to work better in clinical publications than they do in basic research publications

because much of physician education involves learning a strict method of reporting clinical findings.

However, scientists involved in basic research tend to have less indoctrination in a particular way of

revealing their findings, and so the statement of findings doesn't follow a syntactic formula.

Most statistical approaches to the analysis phase of NLP include an assessment word frequency at the

sentence, paragraph, and document level. Word frequency is relevant because words with the lowest

frequency of occurrence tend to have the greatest meaning and significance in a document.

Conversely, words with the highest frequency of occurrence, such as "and", "the", and "a", have

relatively little meaning.

In one statistical approach based on word frequency, a document is represented as a vector of word

frequency, with the individual words or phrases forming the axes of the multi-dimensional space. This

vector can be compared to a library of standard vectors, each of which represents a particular

concept. Because the closeness of the two vectors represents similarity in concepts or at least

content, this method can be used to automatically classify the contents of the document under

analysis.For example, in Figure 7-17 , a document represented by a vector is compared with a vector

that represents the use of microarray spotting of the hemoglobin extracted from patients with sickle-

Search WWH ::

Custom Search

Home