Database Reference
In-Depth Information
words that appear in an equal number of documents. Methods to further weight
words should be considered to refine the IDF score.
The TFIDF (or TF-IDF) is a measure that considers both the prevalence of a term
within a document (TF) and the scarcity of the term over the entire corpus (IDF).
The TFIDF of a term t in a document d is defined as the term frequency of t in d
multiplying the document frequency of t in the corpus as shown in Equation 9.7 :
9.7
TFIDF scores words higher that appear more often in a document but occur less
often across all documents in the corpus. Note that TFIDF applies to a term in a
specific document, so the same term is likely to receive different TFIDF scores in
different documents (because the TF values may be different).
TFIDF is efficient in that the calculations are simple and straightforward, and
it does not require knowledge of the underlying meanings of the text. But this
approach also reveals little of the inter-document or intra-document statistical
structure. The next section shows how topic models can address this shortcoming
of TFIDF.
Search WWH ::




Custom Search