Information Technology Reference
In-Depth Information
one might assign to pages and how to combine page content and sequence
information. They use simple techniques like tokenization and stemming, but
not more complex NLP techniques.
Atkinson uses a technique that is very novel for text mining, genetic al-
gorithms (GAs). Genetic algorithms are typically used for solving problems
where the features can be represented as binary vectors. Atkinson adapts this
to text representations by employing a whole range of numerical and statisti-
cal methods, including LSA and Markov chains, and various metrics build on
these. However, other than some manually constructed contexts for rhetorical
roles, he uses no true NLP techniques.
1.4 Range of Applications
The papers in this topic perform a wide range of applications, some more
traditional for text mining and some quite novel.
Marchisio et al. take a novel approach to a very traditional application,
simple search or document retrieval. They introduce a new paradigm, taking
advantage of the linguistic structure of the documents as opposed to key
words. Their end-user is the average user of a web search engine.
There are several variants on information extraction.
Bunescu and Mooney look at extracting relations, which, along with entity
extraction, is an important current research area in text mining. They focus
on two domains, bioinformatics and newspaper articles, each involving a com-
pletely different set of entities and relations. The former involves entities like
genes, proteins, and cells, and relations like protein-protein interactions and
subcellular localization. The latter involves more familiar entities like people,
organizations, and locations and relations like “belongs to,” “is head of,” etc.
Mustafaraj et al. focus on extracting a different kind of relation, the roles
of different entities relevant to diagnosis in the technical domain of electrical
engineering. These roles include things like “observed object,” “symptom,”
and “cause.” In the end, they are trying to mark-up the text of diagnostic
reports in a way to facilitate search and the extraction of knowledge about
the domain.
Popescu and Etzioni's application is the extraction of product features,
parts, and attributes, and customers' or users' opinions about these (both
positive and negative, and how strongly they feel) from customer product
reviews. These include specialized entities and relations, as well as opinions
and their properties, which do not quite fit into these categories.
Atkinson ventures into another novel extraction paradigm, extracting
knowledge in form of IF-THEN rules from scientific studies. The scientific
domain he focuses on in this particular study is agricultural and food science.
The remaining applications do not fit into any existing text mining niche
very well. Schmidtler et al. need to solve a very practical problem, that of sep-
arating a stack of pages into distinct documents and labeling the document
Search WWH ::




Custom Search