Java Reference
In-Depth Information
Finding parts of text
Text can be decomposed into a number of different types of elements such as words, sen-
tences, and paragraphs. There are several ways of classifying these elements. When we
refer to parts of text in this topic, we are referring to words, sometimes called tokens.
Mor-
phology
is the study of the structure of words. We will use a number of morphology terms
in our exploration of NLP. However, there are many ways of classifying words including
the following:
•
Simple words
: These are the common connotations of what a word means includ-
ing the 17 words of this sentence.
•
Morphemes
: These are the smallest units of a word that is meaningful. For ex-
ample, in the word "bounded", "bound" is considered to be a morpheme. Morph-
emes also include parts such as the suffix, "ed".
•
Prefix/Suffix
: This precedes or follows the root of a word. For example, in the
word graduation, the "ation" is a suffix based on the word "graduate".
•
Synonyms
: This is a word that has the same meaning as another word. Words such
as small and tiny can be recognized as synonyms. Addressing this issue requires
word sense disambiguation.
•
Abbreviations
: These shorten the use of a word. Instead of using Mister Smith, we
use Mr. Smith.
•
Acronyms
: These are used extensively in many fields including computer science.
They use a combination of letters for phrases such as FORmula TRANslation for
FORTRAN. They can be recursive such as GNU. Of course, the one we will con-
tinue to use is NLP.
•
Contractions
: We'll find these useful for commonly used combinations of words
such as the first word of this sentence.
•
Numbers
: A specialized word that normally uses only digits. However, more com-
plex versions can include a period and a special character to reflect scientific nota-
tion or numbers of a specific base.
Identifying these parts is useful for other NLP tasks. For example, to determine the bound-
aries of a sentence, it is necessary to break it apart and determine which elements terminate
a sentence.
The process of breaking text apart is called tokenization. The result is a stream of tokens.
The elements of the text that determine where elements should be split are called
Delim-
iters
. For most English text, whitespace is used as a delimiter. This type of a delimiter typ-
ically includes blanks, tabs, and new line characters.