Database Reference
In-Depth Information
term frequency definition that is the most suitable to the data and the problem to
be solved.
A term frequency vector (shown in Table 9.5 ) can become very high dimensional
because the bag-of-words vector space can grow substantially to include all the
words in English. The high dimensionality makes it difficult to store and parse the
text and contribute to performance issues related to text analysis.
For the purpose of reducing dimensionality, not all the words from a given
language need to be included in the term frequency vector. In English, for example,
it is common to remove words such as the , a , of , and , to , and other articles
that are not likely to contribute to semantic understanding. These common words
are called stop words . Lists of stop words are available in various languages for
automating the identification of stop words. Among them is the Snowball's stop
words list [23] that contains stop words in more than ten languages.
Another simple yet effective way to reduce dimensionality is to store a term and
its frequency only if the term appears at least once in a document. Any term
not existing in the term frequency vector by default will have a frequency of 0.
Therefore, the previous term frequency vector would be simplified to what is
shown in Table 9.6 .
Table 9.6 A Simpler Form of the Term Frequency Vector
Term Frequency
i 1
love 2
my 1
bphone 1
Some NLP techniques such as lemmatization and stemming can also reduce high
dimensionality. Lemmatization and stemming are two different techniques that
combine various forms of a word. With these techniques, words such as play ,
plays , played , and playing can be mapped to the same term.
It has been shown that the term frequency is based on the raw count of a term
occurring in a stand-alone document. Term frequency by itself suffers a critical
problem: It regards that stand-alone document as the entire world. The
importance of a term is solely based on its presence in this particular document.
Stop words such as the , and , and a could be inappropriately considered the
Search WWH ::




Custom Search