Data Sets and Proper Statistical Analysis of Data Mining Techniques - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

Chapter 2

Data Sets and Proper Statistical Analysis

of Data Mining Techniques

Abstract Presenting a Data Mining technique and analyzing it often involves using

a data set related to the domain. In research fortunatelymanywell-known data sets are

available and widely used to check the performance of the technique being consid-

ered. Many of the subsequent sections of this topic include a practical experimental

comparison of the techniques described in each one as a exemplification of this

process. Such comparisons require a clear bed test in order to enable the reader to be

able to replicate and understand the analysis and the conclusions obtained. First we

provide an insight of the data sets used to study the algorithms presented as represen-

tative in each section in Sect. 2.1 . In this section we elaborate on the data sets used in

the rest of the topic indicating their characteristics, sources and availability. We also

delve in the partitioning procedure and how it is expected to alleviate the problematic

associated to the validation of any supervised method as well as the details of the

performance measures that will be used in the rest of the topic. Section 2.2 takes a

tour of the most common statistical techniques required in the literature to provide

meaningful and correct conclusions. The steps followed to correctly use and interpret

the statistical test outcome are also given.

2.1 Data Sets and Partitions

The ultimate goal of any DM process is to be applied to real life problems. As testing

a technique in every problem is unfeasible, the common procedure is to evaluate such

a technique in a set of standard DM problems (or data sets) publicly available. In this

topic we will mainly use the KEEL DM tool which is also supported by the KEEL-

Dataset repository 1 where data sets from different well-known sources as UCI [ 2 ]

and others have been converted to KEEL ARFF format and partitioned. This enables

the user to replicate all the experiments presented in this topic with ease.

As this topic focus on supervised learning, we will provide a list with the data

sets enclosed in this paradigm. The representative data sets that will be used in

classification are shown in Table 2.1 . The table includes themost relevant information

about the data set:

1 http://keel.es/datasets.php .

Search WWH ::

Custom Search

Home