Graphics Reference
In-Depth Information
Chapter 2
Data Sets and Proper Statistical Analysis
of Data Mining Techniques
Abstract Presenting a Data Mining technique and analyzing it often involves using
a data set related to the domain. In research fortunatelymanywell-known data sets are
available and widely used to check the performance of the technique being consid-
ered. Many of the subsequent sections of this topic include a practical experimental
comparison of the techniques described in each one as a exemplification of this
process. Such comparisons require a clear bed test in order to enable the reader to be
able to replicate and understand the analysis and the conclusions obtained. First we
provide an insight of the data sets used to study the algorithms presented as represen-
tative in each section in Sect. 2.1 . In this section we elaborate on the data sets used in
the rest of the topic indicating their characteristics, sources and availability. We also
delve in the partitioning procedure and how it is expected to alleviate the problematic
associated to the validation of any supervised method as well as the details of the
performance measures that will be used in the rest of the topic. Section 2.2 takes a
tour of the most common statistical techniques required in the literature to provide
meaningful and correct conclusions. The steps followed to correctly use and interpret
the statistical test outcome are also given.
2.1 Data Sets and Partitions
The ultimate goal of any DM process is to be applied to real life problems. As testing
a technique in every problem is unfeasible, the common procedure is to evaluate such
a technique in a set of standard DM problems (or data sets) publicly available. In this
topic we will mainly use the KEEL DM tool which is also supported by the KEEL-
Dataset repository 1 where data sets from different well-known sources as UCI [ 2 ]
and others have been converted to KEEL ARFF format and partitioned. This enables
the user to replicate all the experiments presented in this topic with ease.
As this topic focus on supervised learning, we will provide a list with the data
sets enclosed in this paradigm. The representative data sets that will be used in
classification are shown in Table 2.1 . The table includes themost relevant information
about the data set:
1 http://keel.es/datasets.php .
 
Search WWH ::




Custom Search