Java Reference
In-Depth Information
Why is NLP so hard?
NLP is not easy. There are several factors that makes this process hard. For example, there
are hundreds of natural languages, each of which has different syntax rules. Words can be
ambiguous where their meaning is dependent on their context. Here, we will examine a few
of the more significant problem areas.
At the character level, there are several factors that need to be considered. For example, the
encoding scheme used for a document needs to be considered. Text can be encoded using
schemes such as ASCII, UTF-8, UTF-16, or Latin-1. Other factors such as whether the text
should be treated as case-sensitive or not may need to be considered. Punctuation and num-
bers may require special processing. We sometimes need to consider the use of emoticons
(character combinations and special character images), hyperlinks, repeated punctuation
(… or ---), file extension, and usernames with embedded periods. Many of these are
handled by preprocessing text as we will discuss in Preparing data later in the chapter.
When we Tokenize text, it usually means we are breaking up the text into a sequence of
words. These words are called Tokens . The process is referred to as Tokenization . When a
language uses whitespace characters to delineate words, this process is not too difficult.
With a language like Chinese, it can be quite difficult since it uses unique symbols for
words.
Words and morphemes may need to be assigned a part of speech label identifying what
type of unit it is. A Morpheme is the smallest division of text that has meaning. Prefixes
and suffixes are examples of morphemes. Often, we need to consider synonyms, abbrevi-
ation, acronyms, and spellings when we work with words.
Stemming is another task that may need to be applied. Stemming is the process of finding
the word stem of a word. For example, words such as "walking", "walked", or "walks"
have the word stem "walk". Search engines often use stemming to assist in asking a query.
Closely related to stemming is the process of Lemmatization . This process determines the
base form of a word called its lemma . For example, for the word "operating", its stem is
"oper" but its lemma is "operate". Lemmatization is a more refined process than stemming
and uses vocabulary and morphological techniques to find a lemma. This can result in more
precise analysis in some situations.
Search WWH ::




Custom Search