Database Reference
In-Depth Information
Figure 9.3
Words from Brown corpus's news category with the highest corpus
TF, DF, or IDF
Words with higher IDF tend to be more meaningful over the entire corpus. In
other words, the IDF of a rare term would be high, and the IDF of a frequent term
would be low. For example, if a corpus contains 1,000 documents, 1,000 of them
might contain the word
the
, and 10 of them might contain the word
bPhone
. With
which is greater than the IDF of
the
. If a corpus consists of mostly phone reviews,
the word
phone
would probably have high TF and DF but low IDF.
Despite the fact that IDF encourages words that are more meaningful, it comes
with a caveat. Because the total document count of a corpus ( ) remains a
constant, IDF solely depends on the DF. All words having the same DF value
therefore receive the same IDF value. IDF scores words higher that occur less
frequently across the documents. Those words that score the lowest DF receive
appeared in an equal number of documents in the Brown corpus; therefore, they
received the same IDF values. In many cases, it is useful to distinguish between two