Java Reference
In-Depth Information
Chapter 2. Finding Parts of Text
Finding parts of text is concerned with breaking text down into individual units called
tokens, and optionally performing additional processing on these tokens. This additional
processing can include stemming, lemmatization, stopword removal, synonym expansion,
and converting text to lowercase.
We will demonstrate several tokenization techniques found in the standard Java distribu-
tion. These are included because sometimes this is all you may need to do the job. There
may be no need to import NLP libraries in this situation. However, these techniques are
limited. This is followed by a discussion of specific tokenizers or tokenization approaches
supported by NLP APIs. These examples will provide a reference for how the tokenizers
are used and the type of output they produce. This is followed by a simple comparison of
the differences between the approaches.
There are many specialized tokenizers. For example, the Apache Lucene project supports
tokenizers for various languages and specialized documents. The WikipediaToken-
izer class is a tokenizer that handles Wikipedia-specific documents and the Arab-
icAnalyzer class handles Arabic text. It is not possible to illustrate all of these varying
approaches here.
We will also examine how certain tokenizers can be trained to handle specialized text. This
can be useful when a different form of text is encountered. It can often eliminate the need
to write a new and specialized tokenizer.
Next, we will illustrate how some of these tokenizers can be used to support specific opera-
tions such as stemming, lemmatization, and stopword removal. POS can also be considered
as a special instance of parts of text. However, this topic is investigated in Chapter 5 ,
Detecting Parts of Speech .
Search WWH ::




Custom Search