Cleaning and Validating Data - Clojure Data Analysis

Database Reference

In-Depth Information

Sampling by percentage

1.

Performing a rough sampling by percentage is pretty simple:

(defn sample-percent

[k coll] (filter (fn [_] (<= (rand) k)) coll))

2.

Using it is also simple:

user=> (sample-percent 0.01 (range 1000))

(141 146 155 292 598 624 629 640 759 815 852 889)

user=> (count *1)

12

Sampling exactly

Sampling for an exact count is a little more complicated. We'll use Donald Knuth's algorithm

from The Art of Computer Programming, Volume 2 . This takes the sample off the front of

the input sequence, and then from this point, each new item from the input has a chance of

sample-size / size-of-collection-so-far randomly replacing one existing item in the sample. To

implement this, we'll need one helper function which takes a map and a new key-value pair.

It removes a random key from the map and inserts the new pair:

(defn rand-replace

[m k v] (assoc (dissoc m (rand-nth (keys m))) k v))

We'll also need another small utility to create an ininite range that begins at a given place:

(defn range-from [x] (map (partial + x) (range)))

Now, we use this to create the function that does the sampling:

(defn sample-amount [k coll]

(->> coll

(drop k)

(map vector (range-from (inc k)))

(filter #(<= (rand) (/ k (first %))))

(reduce rand-replace

(into {} (map vector (range k) (take k coll))))

(sort-by first)

(map second)))

Using this is as simple as using the irst function though:

user=> (sample-amount 10 (range 1000))

(70 246 309 430 460 464 471 547 955 976)

user=> (count *1)

10

Search WWH ::

Custom Search

Home