Cleaning and Validating Data - Clojure Data Analysis

Database Reference

In-Depth Information

How it works…

Sampling by percentage just compares the percentage against a random value for each item

in the collection. If the random number is less than the value, it saves the item. Notice though,

that since it's random, the exact number that it pulls out doesn't necessarily match the

parameter exactly. In this case, 1 percent.

Sampling by a set amount is more complicated. We keep a map of the sample, keyed by each

item's position in the original sequence. Originally, we populate this map with the irst items

off the sequence. Afterwards, we walk through the rest of the collection. For each item, we

randomly decide whether to keep it or not. If we do keep it, we randomly swap it with one

item that is currently in the sample.

Let's see what this looks like in the code:

1.

Initially, we want to take the sample off the front of the collection. The processing

pipeline in sample-amount will work over the rest of the collection, so we'll begin

by dropping the initial sample off the front:

(defn sample-amount [k coll]

(->> coll

(drop k)

2. In order to igure out each subsequent item's probability of being chosen for the

sample, we need to have its position in the collection. We can get this by associating

each item with its position in a vector pair:

(map vector (range-from (inc k)))

3. Now, ilter out all of the items whose position number, divided by the sample size,

is less than a random number. This randomly replaces each item based on its

position, as outlined in the algorithm:

(filter #(<= (rand) (/ k (first %))))

4. At this point, we start building the inal sample as a hash map that maps each item's

position in the original collection with the item itself. We use rand-replace to swap

out an item from the sample Hashmap for each item that passed the random ilter in

Step 3:

(reduce rand-replace

(into {}

(map vector (range k) (take k coll))))

5.

Once the reduce call is made, we can sort the hash-map by position:

(sort-by first)

Search WWH ::

Custom Search

Home