Database Reference
In-Depth Information
7.3.1 Pre-Processing
Most of the clustering algorithms discussed in this chapter use the standard
vector space model for text, where a text document is represented as a sparse
high-dimensional vector of weighted term counts (41). The creation of the
vector space model can be divided into two stages. At first, the content-
bearing terms (which are typically words or short phrases) are extracted from
the document text and the weight of each term in the document vector is set
to the count of the corresponding term in the document. In the second stage,
the terms are suitably weighted according to information retrieval principles
to increase the weights of important terms.
Some terms in a document do not describe any important content, e.g.,
common words like “the,” “is” - these words are called stop-words. While
processing a document to count the number of occurrences of each term and
create the term count vector in the first phase, these stop-words are usually
filtered from the document and not included in the vector. Note that this
vector is often more than 99% sparse, since the dimensionality of the vector
is equal to the number of terms in the whole document collection and most
documents just have a small subset of these terms.
In the second phase, the term-frequencies or counts of the terms are multi-
plied by the inverse document frequency of a term in the document collection.
This is done so that terms that are common to most documents in a document
collection (e.g., “god” is a common term in a collection of articles posted to
newsgroups like alt.atheism or soc.religion.christian ) are given lesser
weight, since they are not very content-bearing in the context of the collection.
This method of term weighting, called “Term Frequency and Inverse Docu-
ment Frequency” (TFIDF), is a popular method of pre-processing documents
in the information retrieval community (1).
The TFIDF weighting procedure we use is as follows. If f ij is the frequency
of the i th
term in the j th
document, then the corresponding term frequency
(TF) tf ij
is f ij (sometimes normalized) across the entire document corpus:
tf ij = f ij
The inverse document frequency (IDF) idf i of the i th term is defined as:
idf i =log 2 ( N/df i )
where N is the total number of documents in the corpus and df i is the total
number of documents containing the i th term. The overall TFIDF score w ij
of the i th term in the j th document is therefore:
w ij = tf ij idf i = f ij log 2 ( N/df i )
After TFIDF processing, terms which have a very low (occurring in less
than 5 documents) and very high frequency (occurring in more than 95% of
the documents) are sometimes removed from the documents (19) in further
Search WWH ::




Custom Search