Statistical Approximation of Streaming Data - Real-Time Analytics

Database Reference

In-Depth Information

the probability density function , the derivative of the CDF. (The discrete

version is called the probability mass function .)

The Normal and Chi-Square Distributions

The normal distribution , also called a Gaussian distribution , is probably

the most famous of all the statistical distributions. One reason is that its

functional form leads to nice results for many different procedures. For

example, clustering algorithms often implicitly assume that the underlying

distribution of the cluster is normal.

Theother,moreimportant,reasonitissofamousisbecausethedistribution

ofthemeanofobservationsofarandomvariableconvergestowardanormal

distribution as the number of observations goes to infinity. Amazingly, this

happens regardless of the underlying distribution, assuming that certain

conditions are met (they usually are). In other words, if you have enough

data, then you can approximate nearly anything by this distribution (even

discrete distributions!).

The normal distribution has two parameters: a mean parameter (mu) and a

standard deviation parameter (sigma), and a simple implementation for any

real value of x:

public static double dnorm(double x,double mu,double

sig) {

return Math.exp(

Math.pow(x-mu, 2)

/Math.sqrt(2*sig*sig)

)/Math.sqrt(2*Math.PI*sig*sig);

}

If X 1 ,…,X k are normally distributed, then the sum of their squares take on

what is known as a chi-square distribution with k degrees of freedom. So,

the square of a single, normally distributed random variable will have a

chi-square distribution with 1 degree of freedom. The chi-square is used

to model the variance of a normal distribution as well as for analyzing

“contingency tables,” which are used to determine if the rate of occurrence

of an event is different between two groups. The density function for this

distribution is fairly complicated:

Search WWH ::

Custom Search

Home