Database Reference
In-Depth Information
Table 9.5 A Sample Term Frequency Vector
Term Frequency
i 1
love 2
acme 0
my 1
bebook 0
bphone 1
fantastic 0
slow 0
terrible 0
terrific
0
The term frequency function can be logarithmically scaled. Recall that in Figure
3.11 and Figure 3.12 of Chapter 3, “Review of Basic Data Analytic Methods Using
R,” it shows the logarithm can be applied to distribution with a long tail to enable
more data detail. Similarly, the logarithm can be applied to word frequencies
whose distribution also contains a long tail, as shown in Equation 9.2 .
9.2
Because longer documents contain more terms, they tend to have higher term
frequency values. They also tend to contain more distinct terms. These factors
can conspire to raise the term frequency values of longer documents and lead to
undesirable bias favoring longer documents. To address this problem, the term
frequency can be normalized. For example, the term frequency of term t in
document d can be normalized based on the number of terms in d as shown in
Equation 9.3 .
9.3
Besides the three common definitions mentioned earlier, there are other less
common variations [22] of term frequency. In practice, one needs to choose the
Search WWH ::




Custom Search