Overview - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

Finally, the final chapter by Schmitt et al. is rather different from the other

chapters, being more of a tutorial that can benefit students and seasoned pro-

fessionals alike. It shows how to construct a broad range of text mining and

NLP tools using simple UNIX commands and sed and awk (and provides an

excellent primer on these in the process). These tools can be used to perform

a number of functions, from quite basic ones like tokenization, stemming, or

synonym replacement, which are fundamental to many applications, to more

complex or specialized ones, like constructing a concordance (a list of terms

in context from a corpus, a set of documents to be used for training or analy-

sis) or merging text from different formats to capture important information

from each while eliminating irrelevant notations (e.g., eliminating irrelevant

formatting mark-up but retaining information relevant both to the pronun-

ciation and kanji forms of different Japanese characters. This information is

not only useful for people working on UNIX (or Linux), but can be fairly

easily adapted to Perl, which shares much of the regular expression language

features and syntax of the UNIX tools, sed and awk.

1.6 Future Work

With the increased use of the Internet, text mining has become increasingly

important since the term came into popular usage over 10 years ago. Highly

related and specialized fields such as web mining and bioinformatics have also

attracted a lot of research work. However, more work is still needed in several

major directions. (1) Data mining practitioners largely feel that the majority

of data mining work lies in data cleaning and data preparation. This is per-

haps even more true in the case of text mining. Much text data does not follow

prescriptive spelling, grammar or style rules. For example, the language used

in maintenance data, help desk reports, blogs, or email does not resemble that

of well-edited news articles at all. More studies on how and to what degree the

quality of text data affects different types of text mining algorithms, as well

as better methods to 'preprocess' text data would be very beneficial. (2) Prac-

titioners of text mining are rarely sure whether an algorithm demonstrated

to be effective on one type of data will work on another set of data. Stan-

dard test data sets can help compare different algorithms, but they can never

tell us whether an algorithm that performs well on them will perform well

on a particular user's dataset. While establishing a fully articulated natural

language model for each genre of text data is likely an unreachable goal, it

would be extremely useful if researchers could show which types of algorithms

and parameter settings tend to work well on which types of text data, based

on relatively easily ascertained characteristics of the data (e.g., technical vs.

non-technical, edited vs. non-edited, short news vs. long articles, proportion

of unknown vs. known words or jargon words vs. general words, complete,

well-punctuated sentences vs. a series of phrases with little or no punctua-

tion, etc.) (3) The range of text mining applications is now far broader than

Search WWH ::

Custom Search

Home