Cleaning and Validating Data - Clojure Data Analysis

Database Reference

In-Depth Information

2. Now, let's use this function in order to deine a function to rescale by a group:

(defn rescale-by-group [src group dest coll]

(->> coll

(sort-by group)

(group-by group)

vals

(mapcat #(rescale-by-total src dest %))))

3.

We can easily make up some data to test this:

(def word-counts

[{:word 'the, :freq 92, :doc 'a}

{:word 'a, :freq 76,:doc 'a}

{:word 'jack, :freq 4,:doc 'a}

{:word 'the, :freq 3,:doc 'b}

{:word 'a, :freq 2,:doc 'b}

{:word 'mary, :freq 1,:doc 'b}])

Now, we can see how it works:

user=> (pp/pprint (rescale-by-group :freq :doc :scaled

word-counts))

({:freq 92, :word the, :scaled 23/43, :doc a}

{:freq 76, :word a, :scaled 19/43, :doc a}

{:freq 4, :word jack, :scaled 1/43, :doc a}

{:freq 3, :word the, :scaled 1/2, :doc b}

{:freq 2, :word a, :scaled 1/3, :doc b}

{:freq 1, :word mary, :scaled 1/6, :doc b})

We can immediately see that the scaled values are more easily comparable. The scaled

frequencies for the , for example, are approximately in line with each other in the way that

the raw frequencies just aren't (0.53 and 0.5 versus 92 and 3). Of course, since this isn't

a real dataset, the frequencies are meaningless, but this still illustrates the method and

how it improves the dataset.

How it works…

For each function, we pass in a couple of keys: a source key and a destination key.

The irst function, rescale-by-total , totals the values for the source key, and then

sets the destination key to the ratio of the source key for that item and the total for the

source key in all of the items in the collection.

The second function, rescale-by-group , uses another key: the group key. It sorts and

groups the items by group key and then passes each group to rescale-by-total .

Search WWH ::

Custom Search

Home