Database Reference
In-Depth Information
Sampling Procedures
So far, this chapter has introduced the concept of the random variable
and shown how one can generate “draws” from these distributions. When
there is a large population, draws can also be generated by sampling from
the population. Over the decades a number of sampling procedures and a
massive body of literature have been developed. Many of these procedures
are focused on the problem of surveys and polls in the situation that taking
a census of the entire population is too costly, time consuming, or both.
Because this topic is focused primarily on streaming data, most of these
procedures are beyond the scope of this topic.
This section covers the basics of sampling from a fixed population insofar as
it allows for understanding how to sample from a streaming dataset. From
there, some modifications of the basic streaming procedure are introduced
to cover some of the more interesting streaming analysis scenarios.
Sampling from a Fixed Population
The simplest form of sampling from a population is, unsurprisingly, known
as simple random sampling . The goal of this procedure is the sample n
elements from a population of N total elements such that any given element
has an equal chance of being sampled. If all the elements can be held in
RAM and a given element is allowed to be in the sample more than once, the
sampling implementation is trivial:
public class SimpleRandomSample {
static MersenneTwister rng = new MersenneTwister();
public static <E> E[] withReplacement(E[] in, int n) {
Object[] sample = new Object[n];
for ( int i=0;i<n;i++)
sample[i] = in[( int )Math. floor (n*rng.nextInt())];
return (E[])sample;
}
}
If the entire dataset fits in RAM, but an element should be sampled without
replacement , a somewhat different algorithm is used. The base for the
algorithm is the Fisher-Yates Shuffle, which was developed as a
Search WWH ::




Custom Search