Introduction to NLP - Natural Language Processing with Java

Java Reference

In-Depth Information

Why is NLP so hard?

NLP is not easy. There are several factors that makes this process hard. For example, there

are hundreds of natural languages, each of which has different syntax rules. Words can be

ambiguous where their meaning is dependent on their context. Here, we will examine a few

of the more significant problem areas.

At the character level, there are several factors that need to be considered. For example, the

encoding scheme used for a document needs to be considered. Text can be encoded using

schemes such as ASCII, UTF-8, UTF-16, or Latin-1. Other factors such as whether the text

should be treated as case-sensitive or not may need to be considered. Punctuation and num-

bers may require special processing. We sometimes need to consider the use of emoticons

(character combinations and special character images), hyperlinks, repeated punctuation

(… or ---), file extension, and usernames with embedded periods. Many of these are

handled by preprocessing text as we will discuss in Preparing data later in the chapter.

When we Tokenize text, it usually means we are breaking up the text into a sequence of

words. These words are called Tokens . The process is referred to as Tokenization . When a

language uses whitespace characters to delineate words, this process is not too difficult.

With a language like Chinese, it can be quite difficult since it uses unique symbols for

words.

Words and morphemes may need to be assigned a part of speech label identifying what

type of unit it is. A Morpheme is the smallest division of text that has meaning. Prefixes

and suffixes are examples of morphemes. Often, we need to consider synonyms, abbrevi-

ation, acronyms, and spellings when we work with words.

Stemming is another task that may need to be applied. Stemming is the process of finding

the word stem of a word. For example, words such as "walking", "walked", or "walks"

have the word stem "walk". Search engines often use stemming to assist in asking a query.

Closely related to stemming is the process of Lemmatization . This process determines the

base form of a word called its lemma . For example, for the word "operating", its stem is

"oper" but its lemma is "operate". Lemmatization is a more refined process than stemming

and uses vocabulary and morphological techniques to find a lemma. This can result in more

precise analysis in some situations.

Search WWH ::

Custom Search

Home