Java Reference
In-Depth Information
Summary
In this chapter, we illustrated various approaches to tokenize text and perform normaliza-
tion on text. We started with simple tokenization technique based on core Java classes such
as the String class' split method and the StringTokenizer class. These ap-
proaches can be useful when we decide to forgo the use of NLP API classes.
We demonstrated how tokenization can be performed using the OpenNLP, Stanford, and
LingPipe APIs. We found there are variations in how tokenization can be performed and in
options that can be applied in these APIs. A brief comparison of their outputs was
provided.
Normalization was discussed, which can involve converting characters to lowercase, ex-
panding abbreviation, removing stopwords, stemming, and lemmatization. We illustrated
how these techniques can be applied using both core Java classes and the NLP APIs.
In the next chapter, we will investigate the issues involved with determining the end of sen-
tences using various NLP APIs.
Search WWH ::




Custom Search