Constrained Partitional Clustering of Text Data: An Overview - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

7.3.1 Pre-Processing

Most of the clustering algorithms discussed in this chapter use the standard

vector space model for text, where a text document is represented as a sparse

high-dimensional vector of weighted term counts (41). The creation of the

vector space model can be divided into two stages. At first, the content-

bearing terms (which are typically words or short phrases) are extracted from

the document text and the weight of each term in the document vector is set

to the count of the corresponding term in the document. In the second stage,

the terms are suitably weighted according to information retrieval principles

to increase the weights of important terms.

Some terms in a document do not describe any important content, e.g.,

common words like “the,” “is” - these words are called stop-words. While

processing a document to count the number of occurrences of each term and

create the term count vector in the first phase, these stop-words are usually

filtered from the document and not included in the vector. Note that this

vector is often more than 99% sparse, since the dimensionality of the vector

is equal to the number of terms in the whole document collection and most

documents just have a small subset of these terms.

In the second phase, the term-frequencies or counts of the terms are multi-

plied by the inverse document frequency of a term in the document collection.

This is done so that terms that are common to most documents in a document

collection (e.g., “god” is a common term in a collection of articles posted to

newsgroups like alt.atheism or soc.religion.christian ) are given lesser

weight, since they are not very content-bearing in the context of the collection.

This method of term weighting, called “Term Frequency and Inverse Docu-

ment Frequency” (TFIDF), is a popular method of pre-processing documents

in the information retrieval community (1).

The TFIDF weighting procedure we use is as follows. If f ij is the frequency

of the i th

term in the j th

document, then the corresponding term frequency

(TF) tf ij

is f ij (sometimes normalized) across the entire document corpus:

tf ij = f ij

The inverse document frequency (IDF) idf i of the i th term is defined as:

idf i =log 2 ( N/df i )

where N is the total number of documents in the corpus and df i is the total

number of documents containing the i th term. The overall TFIDF score w ij

of the i th term in the j th document is therefore:

w ij = tf ij idf i = f ij log 2 ( N/df i )

After TFIDF processing, terms which have a very low (occurring in less

than 5 documents) and very high frequency (occurring in more than 95% of

the documents) are sometimes removed from the documents (19) in further

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home