Java Reference
In-Depth Information
Uses of tokenizers
The output of tokenization can be used for simple tasks such as spell checkers and process-
ing simple searches. It is also useful for various downstream NLP tasks such as identifying
POS, sentence detection, and classification. Most of the chapters that follow will involve
tasks that require tokenization.
Frequently, the tokenization process is just one step in a larger sequence of tasks. These
steps involve the use of pipelines, as we will illustrate in Using a pipeline later in this
chapter. This highlights the need for tokenizers that produce quality results for the down-
stream task. If the tokenizer does a poor job, then the downstream task will be adversely af-
fected.
There are many different tokenizers and tokenization techniques available in Java. There
are several core Java classes that were designed to support tokenization. Some of these are
now outdated. There are also a number of NLP APIs designed to address both simple and
complex tokenization problems. The next two sections will examine these approaches.
First, we will see what the Java core classes have to offer, and then we will demonstrate a
number of the NLP API tokenization libraries.
Search WWH ::




Custom Search