Database Reference
In-Depth Information
Sampling by percentage
1.
Performing a rough sampling by percentage is pretty simple:
(defn sample-percent
[k coll] (filter (fn [_] (<= (rand) k)) coll))
2.
Using it is also simple:
user=> (sample-percent 0.01 (range 1000))
(141 146 155 292 598 624 629 640 759 815 852 889)
user=> (count *1)
12
Sampling exactly
Sampling for an exact count is a little more complicated. We'll use Donald Knuth's algorithm
from The Art of Computer Programming, Volume 2 . This takes the sample off the front of
the input sequence, and then from this point, each new item from the input has a chance of
sample-size / size-of-collection-so-far randomly replacing one existing item in the sample. To
implement this, we'll need one helper function which takes a map and a new key-value pair.
It removes a random key from the map and inserts the new pair:
(defn rand-replace
[m k v] (assoc (dissoc m (rand-nth (keys m))) k v))
We'll also need another small utility to create an ininite range that begins at a given place:
(defn range-from [x] (map (partial + x) (range)))
Now, we use this to create the function that does the sampling:
(defn sample-amount [k coll]
(->> coll
(drop k)
(map vector (range-from (inc k)))
(filter #(<= (rand) (/ k (first %))))
(reduce rand-replace
(into {} (map vector (range k) (take k coll))))
(sort-by first)
(map second)))
Using this is as simple as using the irst function though:
user=> (sample-amount 10 (range 1000))
(70 246 309 430 460 464 471 547 955 976)
user=> (count *1)
10
 
Search WWH ::




Custom Search