Java Reference
In-Depth Information
Finding parts of text
Text can be decomposed into a number of different types of elements such as words, sen-
tences, and paragraphs. There are several ways of classifying these elements. When we
refer to parts of text in this topic, we are referring to words, sometimes called tokens. Mor-
phology is the study of the structure of words. We will use a number of morphology terms
in our exploration of NLP. However, there are many ways of classifying words including
the following:
Simple words : These are the common connotations of what a word means includ-
ing the 17 words of this sentence.
Morphemes : These are the smallest units of a word that is meaningful. For ex-
ample, in the word "bounded", "bound" is considered to be a morpheme. Morph-
emes also include parts such as the suffix, "ed".
Prefix/Suffix : This precedes or follows the root of a word. For example, in the
word graduation, the "ation" is a suffix based on the word "graduate".
Synonyms : This is a word that has the same meaning as another word. Words such
as small and tiny can be recognized as synonyms. Addressing this issue requires
word sense disambiguation.
Abbreviations : These shorten the use of a word. Instead of using Mister Smith, we
use Mr. Smith.
Acronyms : These are used extensively in many fields including computer science.
They use a combination of letters for phrases such as FORmula TRANslation for
FORTRAN. They can be recursive such as GNU. Of course, the one we will con-
tinue to use is NLP.
Contractions : We'll find these useful for commonly used combinations of words
such as the first word of this sentence.
Numbers : A specialized word that normally uses only digits. However, more com-
plex versions can include a period and a special character to reflect scientific nota-
tion or numbers of a specific base.
Identifying these parts is useful for other NLP tasks. For example, to determine the bound-
aries of a sentence, it is necessary to break it apart and determine which elements terminate
a sentence.
The process of breaking text apart is called tokenization. The result is a stream of tokens.
The elements of the text that determine where elements should be split are called Delim-
iters . For most English text, whitespace is used as a delimiter. This type of a delimiter typ-
ically includes blanks, tabs, and new line characters.
Search WWH ::




Custom Search