Working with Unstructured and Textual Data - Clojure Data Analysis

Database Reference

In-Depth Information

"cow" 1/19,

"anyhow" 1/19,

"hope" 1/19,

"never" 2/19,

"saw" 1/19,

"'d" 1/19,

"." 4/19,

"one" 2/19,

"," 1/19,

"rather" 1/19}

Now, we can easily compare these values to the frequencies generated from other documents.

How it works…

This works by changing all of the raw frequencies into ratios based on each document's size.

These numbers are comparable. In our example, from the introduction to this recipe, 0.046

(23/500) is obviously slightly more than 0.040 (40/1000). However, both of these numbers

are ridiculously high. Words that typically occur this much in English are words such as the .

Document-scaled frequencies do have problems with shorter texts. For example, take this

tweet by the Twitter user @LegoAcademics :

"Dr Brown's random number algorithm is based on the bafling loor sequences

chosen by the Uni library elevator".

In this tweet, let's see what the scaled frequency of random is:

(-> (str "Dr Brown's random number algorithm is based "

"on the baffling floor seqeuences chosen by "

"the Uni library elevator.")

tokenize

normalize

frequencies

scale-by-total

(get "random")

float)

This gives us 0.05. Again, this is ridiculously high. Most other tweets won't include the term

random at all. Because of this, you still can only compare tweets.

Search WWH ::

Custom Search

Home