Data Sets and Proper Statistical Analysis of Data Mining Techniques - Data Preprocessing in Data Mining - page 26

Graphics Reference

In-Depth Information

for training and testing in different partitions). Hold-out partitions can be safely take

as independent, since training and tests partitions do not overlap.

The independence of the events in terms of getting results is usually obvious, given

that they are independent runs of the algorithmwith randomly generated initial seeds.

In the following, we show a normality analysis by using KolmogorovSmirnov's,

ShapiroWilk's and D'AgostinoPearson's tests, together with a heteroscedasticity

analysis by using Levene's test in order to show the reader how to check such property.

2.2.2 Normality Test over the Group of Data Sets

and Algorithms

Let us consider an small case of study, where we take into account an stochastic

algorithm that needs a seed to generate its model. A classic example of these types

of algorithms is the MLP. Using a small set of 6 well-known classification problems,

we aim to analyze whether the conditions required to safely perform a parametric

statistical analysis are held. We have used a 10-FCV validation scheme in which

MLP is run 5 times per fold, thus obtaining 50 results per data set. Please note that

using a k -FCV will mean that independence is not held but it is the most common

validation scheme used in classification so this study case turns out to be relevant.

First of all, we want to check if our samples follow a normal distribution. In

Table 2.2 the p -values obtained for the normality test were described in the previous

section. As we can observe, in many cases the normality assumption is not held

(indicated by an “ a ” in the table).

In addition to this general study, we show the sample distribution in three cases,

with the objective of illustrating representative cases in which the normality tests

obtain different results.

FromFig. 2.4 to 2.6 , different examples of graphical representations of histograms

and Q-Q graphics are shown. A histogram represents a statistical variable by using

bars, so that the area of each bar is proportional to the frequency of the represented

values. A Q-Q graphic represents a confrontation between the quartiles from data

observed and those from the normal distributions.

In Fig. 2.4 we can observe a general case in which the property of abnormality

is clearly presented. On the contrary, Fig. 2.5 is the illustration of a sample whose

distribution follows a normal shape, and the three normality tests employed verified

Table 2.2 Normality test applied to a sample case

Cleveland

Glass

Iris

Pima

Wine

Wisconsin

0.00 a

0.00 a

0.00 a

0.09 a

Kolmogorov-Smirnov

0.09

0.20

0.00 a

0.00 a

0.00 a

0.02 a

Shapiro-Wilk

0.04

0.80

0.01 a

0.02 a

0.00 a

D'Agostino-Pearson

0.08

0.51

0.27

a

indicates that the normality is not satisfied

Next Page

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home