Database Reference
In-Depth Information
"cow" 1/19,
"anyhow" 1/19,
"hope" 1/19,
"never" 2/19,
"saw" 1/19,
"'d" 1/19,
"." 4/19,
"one" 2/19,
"," 1/19,
"rather" 1/19}
Now, we can easily compare these values to the frequencies generated from other documents.
How it works…
This works by changing all of the raw frequencies into ratios based on each document's size.
These numbers are comparable. In our example, from the introduction to this recipe, 0.046
(23/500) is obviously slightly more than 0.040 (40/1000). However, both of these numbers
are ridiculously high. Words that typically occur this much in English are words such as the .
Document-scaled frequencies do have problems with shorter texts. For example, take this
tweet by the Twitter user @LegoAcademics :
"Dr Brown's random number algorithm is based on the bafling loor sequences
chosen by the Uni library elevator".
In this tweet, let's see what the scaled frequency of random is:
(-> (str "Dr Brown's random number algorithm is based "
"on the baffling floor seqeuences chosen by "
"the Uni library elevator.")
tokenize
normalize
frequencies
scale-by-total
(get "random")
float)
This gives us 0.05. Again, this is ridiculously high. Most other tweets won't include the term
random at all. Because of this, you still can only compare tweets.
 
Search WWH ::




Custom Search