Database Reference
In-Depth Information
Term weighting schemes
In Chapter 3 , Obtaining, Processing, and Preparing Data with Spark , we looked at vector
representation where text features are mapped to a simple binary vector called the bag-of-
words model. Another representation used commonly in practice is called term
frequency-inverse document frequency ( TF-IDF ).
TF-IDF weights each term in a piece of text (referred to as a document ) based on its fre-
quency in the document (the term frequency ). A global normalization, called the inverse
document frequency , is then applied based on the frequency of this term among all docu-
ments (the set of documents in a dataset is commonly referred to as a corpus ). The stand-
ard definition of TF-IDF is shown here:
tf-idf(t,d) = tf(t,d) x idf(t)
Here, tf(t,d) is the frequency (number of occurrences) of term t in document d and idf(t) is
the inverse document frequency of term t in the corpus; this is defined as follows:
idf(t) = log(N / d)
Here, N is the total number of documents, and d is the number of documents in which the
term t occurs.
The TF-IDF formulation means that terms occurring many times in a document receive a
higher weighting in the vector representation relative to those that occur few times in the
document. However, the IDF normalization has the effect of reducing the weight of terms
that are very common across all documents. The end result is that truly rare or important
terms should be assigned higher weighting, while more common terms (which are assumed
to have less importance) should have less impact in terms of weighting.
Note
A good resource to learn more about the bag-of-words model (or vector space model ) is
the topic Introduction to Information Retrieval , Christopher D. Manning, Prabhakar
Raghavan and Hinrich Schütze , Cambridge University Press (available in HTML form at
http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html ) .
It contains sections on text processing techniques, including tokenization, stop word re-
moval, stemming, and the vector space model, as well as weighting schemes such as TF-
IDF.
Search WWH ::




Custom Search