Statistical Approximation of Streaming Data - Real-Time Analytics

Database Reference

In-Depth Information

Sampling Procedures

So far, this chapter has introduced the concept of the random variable

and shown how one can generate “draws” from these distributions. When

there is a large population, draws can also be generated by sampling from

the population. Over the decades a number of sampling procedures and a

massive body of literature have been developed. Many of these procedures

are focused on the problem of surveys and polls in the situation that taking

a census of the entire population is too costly, time consuming, or both.

Because this topic is focused primarily on streaming data, most of these

procedures are beyond the scope of this topic.

This section covers the basics of sampling from a fixed population insofar as

it allows for understanding how to sample from a streaming dataset. From

there, some modifications of the basic streaming procedure are introduced

to cover some of the more interesting streaming analysis scenarios.

Sampling from a Fixed Population

The simplest form of sampling from a population is, unsurprisingly, known

as simple random sampling . The goal of this procedure is the sample n

elements from a population of N total elements such that any given element

has an equal chance of being sampled. If all the elements can be held in

RAM and a given element is allowed to be in the sample more than once, the

sampling implementation is trivial:

public class SimpleRandomSample {

static MersenneTwister rng = new MersenneTwister();

public static <E> E[] withReplacement(E[] in, int n) {

Object[] sample = new Object[n];

for ( int i=0;i<n;i++)

sample[i] = in[( int )Math. floor (n*rng.nextInt())];

return (E[])sample;

}

If the entire dataset fits in RAM, but an element should be sampled without

replacement , a somewhat different algorithm is used. The base for the

algorithm is the Fisher-Yates Shuffle, which was developed as a

Search WWH ::

Custom Search

Home