Cleaning and Validating Data - Clojure Data Analysis

Database Reference

In-Depth Information

How will you go about ixing this? It might be easier to have separate versions of this

function for integers and loats. In the end, you need to know your data in order to decide

how to best handle it.

Calculating relative values

One way to normalize values is to scale frequencies by the sizes of their groups. For example,

say the word truth appears three times in a document. This means one thing if the document

has thirty words. It means something else if the document has 300 or 3,000 words. Moreover,

if the dataset has documents of all these lengths, how do you compare the frequencies for

words across documents?

One way to do this is to rescale the frequency counts. In some cases, we can just scale the

terms by the length of the documents. Or, if we want better results, we might use something

more complicated such as term frequency-inverse document frequency (TF-IDF).

For this recipe, we'll rescale some term frequencies by the total word count for their document.

Getting ready

We don't need much for this recipe. We'll use the minimal project.clj ile, which is

listed here:

(defproject cleaning-data "0.1.0-SNAPSHOT"

:dependencies [[org.clojure/clojure "1.6.0"]])

However, it will be easier if we have a 'pretty-printer' available in the REPL:

(require '[clojure.pprint :as pp])

How to do it…

Actually, let's frame this problem in a more abstract manner. If each datum is a map, we can

rescale one key ( :frequency ) by the total of this key's values in the group deined by another

key ( :document ). This is a more general approach and should be useful in more situations.

1. Let's deine a function that rescales by a key's total in a collection. It assigns the

scaled value to a new key ( dest ):

(defn rescale-by-total [src dest coll]

(let [total (reduce + (map src coll))

update #(assoc % dest (/ (% src) total))]

(map update coll)))

Search WWH ::

Custom Search

Home