Database Reference
In-Depth Information
9.5
If the term is not in the corpus, it leads to a division-by-zero. A quick fix is to add 1
to the denominator, as demonstrated in Equation 9.6 .
9.6
The precise base of the logarithm is not material to the ranking of a term.
Mathematically, the base constitutes a constant multiplicative factor towards the
overall result.
Figure 9.3 shows 50 words with (a) the highest corpus-wide term frequencies (TF),
(b) the highest document frequencies (DF), and (c) the highest Inverse document
frequencies (IDF) from the news category of the Brown Corpus. Stop words tend
to have higher TF and DF because they are likely to appear more often in most
documents.
Search WWH ::




Custom Search