Finding Parts of Text - Natural Language Processing with Java

Java Reference

In-Depth Information

Chapter 2. Finding Parts of Text

Finding parts of text is concerned with breaking text down into individual units called

tokens, and optionally performing additional processing on these tokens. This additional

processing can include stemming, lemmatization, stopword removal, synonym expansion,

and converting text to lowercase.

We will demonstrate several tokenization techniques found in the standard Java distribu-

tion. These are included because sometimes this is all you may need to do the job. There

may be no need to import NLP libraries in this situation. However, these techniques are

limited. This is followed by a discussion of specific tokenizers or tokenization approaches

supported by NLP APIs. These examples will provide a reference for how the tokenizers

are used and the type of output they produce. This is followed by a simple comparison of

the differences between the approaches.

There are many specialized tokenizers. For example, the Apache Lucene project supports

tokenizers for various languages and specialized documents. The WikipediaToken-

izer class is a tokenizer that handles Wikipedia-specific documents and the Arab-

icAnalyzer class handles Arabic text. It is not possible to illustrate all of these varying

approaches here.

We will also examine how certain tokenizers can be trained to handle specialized text. This

can be useful when a different form of text is encountered. It can often eliminate the need

to write a new and specialized tokenizer.

Next, we will illustrate how some of these tokenizers can be used to support specific opera-

tions such as stemming, lemmatization, and stopword removal. POS can also be considered

as a special instance of parts of text. However, this topic is investigated in Chapter 5 ,

Detecting Parts of Speech .

Search WWH ::

Custom Search

Home