Finding Parts of Text - Natural Language Processing with Java

Java Reference

In-Depth Information

Summary

In this chapter, we illustrated various approaches to tokenize text and perform normaliza-

tion on text. We started with simple tokenization technique based on core Java classes such

as the String class' split method and the StringTokenizer class. These ap-

proaches can be useful when we decide to forgo the use of NLP API classes.

We demonstrated how tokenization can be performed using the OpenNLP, Stanford, and

LingPipe APIs. We found there are variations in how tokenization can be performed and in

options that can be applied in these APIs. A brief comparison of their outputs was

provided.

Normalization was discussed, which can involve converting characters to lowercase, ex-

panding abbreviation, removing stopwords, stemming, and lemmatization. We illustrated

how these techniques can be applied using both core Java classes and the NLP APIs.

In the next chapter, we will investigate the issues involved with determining the end of sen-

tences using various NLP APIs.

Search WWH ::

Custom Search

Home