Data Sets and Proper Statistical Analysis of Data Mining Techniques - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

the “no free lunch” theorem) than to work with partial knowledge about the problem,

knowledge that allows us to design algorithms with specific characteristics which

can make them more suitable to solve of the problem.

2.2.1 Conditions for the Safe Use of Parametric Tests

In [ 24 ] the distinction between parametric and non-parametric tests is based on the

level of measure represented by the data to be analyzed. That is, a parametric test

usually uses data composed by real values.

However the latter does not imply that when we always dispose of this type of

data, we should use a parametric test. Other initial assumptions for a safe usage of

parametric tests must be fulfilled. The non fulfillment of these conditions might cause

a statistical analysis to lose credibility.

The following conditions are needed in order to safely carry out parametric tests

[ 24 , 32 ]:

•

Independence : In statistics, two events are independent when the fact that one

occurs does not modify the probability of the other one occurring.

•

Normality : An observation is normal when its behaviour follows a normal or

Gauss distribution with a certain value of average

. A normality

test applied over a sample can indicate the presence or absence of this condition

in observed data. Three normality tests are usually used in order to check whether

normality is present or not:

μ

and variance

σ

- Kolmogorov-Smirnov : compares the accumulated distribution of observed data

with the accumulated distribution expected from a Gaussian distribution, obtain-

ing the p -value based on both discrepancies.

- Shapiro-Wilk : analyzes the observed data to compute the level of symmetry and

kurtosis (shape of the curve) in order to compute the difference with respect to

a Gaussian distribution afterwards, obtaining the p -value from the sum of the

squares of the discrepancies.

- D'Agostino-Pearson : first computes the skewness and kurtosis to quantify how

far from Gaussian the distribution is in terms of asymmetry and shape. It then

calculates how far each of these values differs from the value expected with

a Gaussian distribution, and computes a single p-value from the sum of the

discrepancies.

•

Heteroscedasticity : This property indicates the existence of a violation of the

hypothesis of equality of variances. Levene's test is used for checking whether or

not k samples present this homogeneity of variances (homoscedasticity). When

observed data does not fulfill the normality condition, this test's result is more

reliable than Bartlett's test [ 32 ], which checks the same property.

With respect to the independence condition, Demsar suggests in [ 5 ] that indepen-

dency is not truly verified in k -FCV and 5

×

2CV (a portion of samples is used either

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home