Java Reference
In-Depth Information
Text format : Text is often stored or presented using different formats. How
simple text is processed versus HTML or other markup techniques will complic-
ate the tokenization process.
Stopwords : Commonly used words might not be important for some NLP tasks
such as general searches. These common words are called stopwords. Stopwords
are sometimes removed when they do not contribute to the NLP task at hand.
These can include words such as "a", "and", and "she".
Text expansion : For acronyms and abbreviations, it is sometimes desirable to ex-
pand them so that postprocesses can produce better quality results. For example,
if a search is interested in the word "machine", then knowing that IBM stands for
International Business Machines can be useful.
Case : The case of a word (upper or lower) may be significant in some situations.
For example, the case of a word can help identify proper nouns. When identifying
the parts of text, conversion to the same case can be useful in simplifying
searches.
Stemming and lemmatization: These processes will alter the words to get to
their "roots".
Removing stopwords can save space in an index and make the indexing process faster.
However, some search engines do not remove stopwords because they can be useful for
certain queries. For example, when performing an exact match, removing stopwords will
result in misses. Also, the NER task often depends on stopword inclusion. Recognizing
that "Romeo and Juliet" is a play is dependent on the inclusion of the word "and".
Note
There are many lists which define stopwords. Sometimes what constitutes a stopword is
dependent on the problem domain. A list of stopwords can be found at ht-
tp://www.ranks.nl/stopwords . It lists a few categories of English stopwords and stopwords
for languages other than English. At http://www.textfixer.com/resources/common-english-
words.txt , you will find a comma-separated formatted list of English stopwords.
A list of the top ten stopwords adapted from Stanford ( http://library.stanford.edu/blogs/
digital-library-blog/2011/12/stopwords-searchworks-be-or-not-be ) are listed in the follow-
ing table:
Stopword
Occurrences
the
7,578
Search WWH ::




Custom Search