Graphics Reference
In-Depth Information
Table 2.1
(continued)
Acronym
Data set
#Attributes (R/I/N)
#Examples
#Classes
Miss val.
SAT
Satimage
36 (0/36/0)
6435
7
No
SEG
Segment
19 (19/0/0)
2310
7
No
SON
Sonar
60 (60/0/0)
208
2
No
SPO
Sponge
45 (0/3/42)
76
12
Ye s
SPA
Spambase
57 (57/0/0)
4597
2
No
SPH
Specfheart
44 (0/44/0)
267
2
No
TAE
Tae
5 (0/5/0)
151
3
No
TNC
Titanic
3 (3/0/0)
2201
2
No
VEH
Ve h i c l e
18 (0/18/0)
846
4
No
VOW
Vow e l
13 (10/3/0)
990
11
No
WAT
Water treatment
38 (38/0/0)
526
13
Ye s
WIN
Wine
13 (13/0/0)
178
3
No
WIS
Wisconsin
9 (0/9/0)
699
2
Ye s
YEA
Yeast
8 (8/0/0)
1484
10
No
The name of the data set and the abbreviation that will be used as future reference.
#Attributes is the number of attributes/features and their type. R stands for real
valued attributes, I means integer attributes and N indicates the number of nominal
attributes.
#Examples is the number of examples/instances contained in the data set.
#Classes is the quantity of different class labels the problem presents.
Whether the data set contains MVs or not.
2.1.1 Data Set Partitioning
The benchmark data sets presented are used with one goal: to evaluate the perfor-
mance of a given model over a set of well-known standard problems. Thus the results
can be replicated by other users and compared to new proposals. However the data
must be correctly used in order to avoid bias in the results.
If the whole data set is used for both build and validate the model generated by a
ML algorithm, we have no clue about how the model will behave with new, unseen
cases. Two main problems may arise by using the same data to train and evaluate the
model:
Underfitting is the easiest problem to understand. It happens when the model
is poorly adjusted to the data, suffering from high error both in training and test
(unseen) data.
 
 
Search WWH ::




Custom Search