Overview - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

quires a morphological analysis of the words in their domain, given the nature

of German, their target language. They explore the use of off-the-shelf POS

taggers and morphological analyzers for this purpose, but find them falling

short in their domain (a technical one, electrical fault diagnosis), and have

to result to hand coding the morphological rules. Another couple of NLP re-

sources that they utilize are FrameNet and VerbNet to find relevant verbs and

relationships to map into their knowledge-engineering categories, but this is

used off-line for analysis rather than in on-line processing. Finally, they use

active learning to e ciently train their classifiers, a statistical technique that

is relatively new to text mining (or data mining in general, for that matter).

Marchisio et al. utilize NLP techniques almost exclusively, writing their

own parser to do full parsing and using their novel indexing technique to

compress complex parse forests in a way that captures basic dependency rela-

tions like subject-of, object-of, and verb-modification like time, location, etc.,

as well as extended relations involving the modifiers of the entities involved

in the basic relations or other entities associated with them in the text or in

background knowledge. The index allows them to rapidly access all of these

relations, permitting them to be used in document search, an area that has

long been considered not to derive any benefit from any but surface NLP

techniques like tokenization and stemming. This entails a whole new protocol

for search, however, and the focus of their article is on how well users adapt

to this new protocol.

1.3 Non-NLP Techniques

Boontham et al. discuss the use of three different approaches to categoriz-

ing the free text responses of students to open-ended questions: simple word

matching, Latent Semantic Analysis (LSA), and a variation on LSA which

they call Topic Models. LSA and Topic Models are both numerical meth-

ods for generating new features based on linear algebra and ultimately begin

with a representation of the text as a bag of words. In addition, they use

discriminant analysis from statistics for classification. Stemming and soundex

(a method for correcting misspelling by representing words in a way that

roughly corresponds to their pronunciation) are used in the word matching

component. Stemming is the only NLP technique used.

McCarthy et al. also use LSA as their primary technique, employing it to

compare different sections of a document rather than whole documents and

develop a “signature” of documents based on the correlation between different

sections.

Schmidtler and Amtrup combine an SVM with a Markov chain to de-

termine how to separate sequences of text pages into distinct documents of

different types given that the text pages are very noisy, being the product

of optical character recognition. They do a nice job of exploring the different

ways they might model a sequence of pages, in terms both of what categories

Search WWH ::

Custom Search

Home