Database Reference
In-Depth Information
Figure 9.3 Words from Brown corpus's news category with the highest corpus
TF, DF, or IDF
Words with higher IDF tend to be more meaningful over the entire corpus. In
other words, the IDF of a rare term would be high, and the IDF of a frequent term
would be low. For example, if a corpus contains 1,000 documents, 1,000 of them
might contain the word the , and 10 of them might contain the word bPhone . With
Equation 9.5 , the IDF of the would be 0, and the IDF of bPhone would be log100,
which is greater than the IDF of the . If a corpus consists of mostly phone reviews,
the word phone would probably have high TF and DF but low IDF.
Despite the fact that IDF encourages words that are more meaningful, it comes
with a caveat. Because the total document count of a corpus ( ) remains a
constant, IDF solely depends on the DF. All words having the same DF value
therefore receive the same IDF value. IDF scores words higher that occur less
frequently across the documents. Those words that score the lowest DF receive
the same highest IDF. In Figure 9.3 (c), for example, sunbonnet and narcotic
appeared in an equal number of documents in the Brown corpus; therefore, they
received the same IDF values. In many cases, it is useful to distinguish between two
Search WWH ::




Custom Search