Finding Parts of Text - Natural Language Processing with Java

Java Reference

In-Depth Information

Understanding the parts of text

There are a number of ways of categorizing parts of text. For example, we may be con-

cerned with character-level issues such as punctuations with a possible need to ignore or

expand contractions. At the word level, we may need to perform different operations such

as:

• Identifying morphemes using stemming and/or lemmatization

• Expanding abbreviations and acronyms

• Isolating number units

We cannot always split words with punctuations because the punctuations are sometimes

considered to be part of the word, such as the word "can't". We may also be concerned with

grouping multiple words to form meaningful phrases. Sentence detection can also be a

factor. We do not necessarily want to group words that cross sentence boundaries.

In this chapter, we are primarily concerned with the tokenization process and a few special-

ized techniques such as stemming. We will not attempt to show how they are used in other

NLP tasks. Those efforts are reserved for later chapters.

Search WWH ::

Custom Search

Home