Information Technology Reference
In-Depth Information
Finally, the final chapter by Schmitt et al. is rather different from the other
chapters, being more of a tutorial that can benefit students and seasoned pro-
fessionals alike. It shows how to construct a broad range of text mining and
NLP tools using simple UNIX commands and sed and awk (and provides an
excellent primer on these in the process). These tools can be used to perform
a number of functions, from quite basic ones like tokenization, stemming, or
synonym replacement, which are fundamental to many applications, to more
complex or specialized ones, like constructing a concordance (a list of terms
in context from a corpus, a set of documents to be used for training or analy-
sis) or merging text from different formats to capture important information
from each while eliminating irrelevant notations (e.g., eliminating irrelevant
formatting mark-up but retaining information relevant both to the pronun-
ciation and kanji forms of different Japanese characters. This information is
not only useful for people working on UNIX (or Linux), but can be fairly
easily adapted to Perl, which shares much of the regular expression language
features and syntax of the UNIX tools, sed and awk.
1.6 Future Work
With the increased use of the Internet, text mining has become increasingly
important since the term came into popular usage over 10 years ago. Highly
related and specialized fields such as web mining and bioinformatics have also
attracted a lot of research work. However, more work is still needed in several
major directions. (1) Data mining practitioners largely feel that the majority
of data mining work lies in data cleaning and data preparation. This is per-
haps even more true in the case of text mining. Much text data does not follow
prescriptive spelling, grammar or style rules. For example, the language used
in maintenance data, help desk reports, blogs, or email does not resemble that
of well-edited news articles at all. More studies on how and to what degree the
quality of text data affects different types of text mining algorithms, as well
as better methods to 'preprocess' text data would be very beneficial. (2) Prac-
titioners of text mining are rarely sure whether an algorithm demonstrated
to be effective on one type of data will work on another set of data. Stan-
dard test data sets can help compare different algorithms, but they can never
tell us whether an algorithm that performs well on them will perform well
on a particular user's dataset. While establishing a fully articulated natural
language model for each genre of text data is likely an unreachable goal, it
would be extremely useful if researchers could show which types of algorithms
and parameter settings tend to work well on which types of text data, based
on relatively easily ascertained characteristics of the data (e.g., technical vs.
non-technical, edited vs. non-edited, short news vs. long articles, proportion
of unknown vs. known words or jargon words vs. general words, complete,
well-punctuated sentences vs. a series of phrases with little or no punctua-
tion, etc.) (3) The range of text mining applications is now far broader than
Search WWH ::




Custom Search