Information Technology Reference
In-Depth Information
quires a morphological analysis of the words in their domain, given the nature
of German, their target language. They explore the use of off-the-shelf POS
taggers and morphological analyzers for this purpose, but find them falling
short in their domain (a technical one, electrical fault diagnosis), and have
to result to hand coding the morphological rules. Another couple of NLP re-
sources that they utilize are FrameNet and VerbNet to find relevant verbs and
relationships to map into their knowledge-engineering categories, but this is
used off-line for analysis rather than in on-line processing. Finally, they use
active learning to e ciently train their classifiers, a statistical technique that
is relatively new to text mining (or data mining in general, for that matter).
Marchisio et al. utilize NLP techniques almost exclusively, writing their
own parser to do full parsing and using their novel indexing technique to
compress complex parse forests in a way that captures basic dependency rela-
tions like subject-of, object-of, and verb-modification like time, location, etc.,
as well as extended relations involving the modifiers of the entities involved
in the basic relations or other entities associated with them in the text or in
background knowledge. The index allows them to rapidly access all of these
relations, permitting them to be used in document search, an area that has
long been considered not to derive any benefit from any but surface NLP
techniques like tokenization and stemming. This entails a whole new protocol
for search, however, and the focus of their article is on how well users adapt
to this new protocol.
1.3 Non-NLP Techniques
Boontham et al. discuss the use of three different approaches to categoriz-
ing the free text responses of students to open-ended questions: simple word
matching, Latent Semantic Analysis (LSA), and a variation on LSA which
they call Topic Models. LSA and Topic Models are both numerical meth-
ods for generating new features based on linear algebra and ultimately begin
with a representation of the text as a bag of words. In addition, they use
discriminant analysis from statistics for classification. Stemming and soundex
(a method for correcting misspelling by representing words in a way that
roughly corresponds to their pronunciation) are used in the word matching
component. Stemming is the only NLP technique used.
McCarthy et al. also use LSA as their primary technique, employing it to
compare different sections of a document rather than whole documents and
develop a “signature” of documents based on the correlation between different
sections.
Schmidtler and Amtrup combine an SVM with a Markov chain to de-
termine how to separate sequences of text pages into distinct documents of
different types given that the text pages are very noisy, being the product
of optical character recognition. They do a nice job of exploring the different
ways they might model a sequence of pages, in terms both of what categories
Search WWH ::




Custom Search