Database Reference
In-Depth Information
How it works…
Sampling by percentage just compares the percentage against a random value for each item
in the collection. If the random number is less than the value, it saves the item. Notice though,
that since it's random, the exact number that it pulls out doesn't necessarily match the
parameter exactly. In this case, 1 percent.
Sampling by a set amount is more complicated. We keep a map of the sample, keyed by each
item's position in the original sequence. Originally, we populate this map with the irst items
off the sequence. Afterwards, we walk through the rest of the collection. For each item, we
randomly decide whether to keep it or not. If we do keep it, we randomly swap it with one
item that is currently in the sample.
Let's see what this looks like in the code:
1.
Initially, we want to take the sample off the front of the collection. The processing
pipeline in sample-amount will work over the rest of the collection, so we'll begin
by dropping the initial sample off the front:
(defn sample-amount [k coll]
(->> coll
(drop k)
2. In order to igure out each subsequent item's probability of being chosen for the
sample, we need to have its position in the collection. We can get this by associating
each item with its position in a vector pair:
(map vector (range-from (inc k)))
3. Now, ilter out all of the items whose position number, divided by the sample size,
is less than a random number. This randomly replaces each item based on its
position, as outlined in the algorithm:
(filter #(<= (rand) (/ k (first %))))
4. At this point, we start building the inal sample as a hash map that maps each item's
position in the original collection with the item itself. We use rand-replace to swap
out an item from the sample Hashmap for each item that passed the random ilter in
Step 3:
(reduce rand-replace
(into {}
(map vector (range k) (take k coll))))
5.
Once the reduce call is made, we can sort the hash-map by position:
(sort-by first)
 
Search WWH ::




Custom Search