Java Reference
In-Depth Information
Understanding the parts of text
There are a number of ways of categorizing parts of text. For example, we may be con-
cerned with character-level issues such as punctuations with a possible need to ignore or
expand contractions. At the word level, we may need to perform different operations such
as:
• Identifying morphemes using stemming and/or lemmatization
• Expanding abbreviations and acronyms
• Isolating number units
We cannot always split words with punctuations because the punctuations are sometimes
considered to be part of the word, such as the word "can't". We may also be concerned with
grouping multiple words to form meaningful phrases. Sentence detection can also be a
factor. We do not necessarily want to group words that cross sentence boundaries.
In this chapter, we are primarily concerned with the tokenization process and a few special-
ized techniques such as stemming. We will not attempt to show how they are used in other
NLP tasks. Those efforts are reserved for later chapters.
Search WWH ::




Custom Search