Database Reference
In-Depth Information
5 Empirical Scalability Study
The empirical study comprises size-up and speed-up experiments on several
benchmark datasets. Size-up experiments examine the algorithm's performance
(runtime) with respect to the size of the training data; and speed-up experiments
examine the algorithm's performance with respect to the number of computing
nodes used, using speed-up factors as highlighted in the previous section. For
the experiments we used two synthetic datasets from the infobiotics data repos-
itory [ 2 ]. We have chosen these datasets as they can still be run on a single
computing node in our cluster, which can be used as a reference point. The
datasets are outlined in Table 1 . The Hadoop cluster is hosted on 10 identical
off the shelf workstations, each comprising 1 GB memory, 2.8 GHz CPUs and
a XUbuntu operating system. The Hadoop version installed on the cluster was
0.20.203.0rc1. All experiments highlighted in this section measure the total run-
time from the loading of the data to the cluster, up to aggregating the results
at the Reducer.
Table 1. Datasets used for evaluation. Attributes hold double values and class values
are represented by a single character.
Test data Number of data instances Number of attributes Number of classes
1
50000
5
5
2
30000
3
2
Again, size-up experiments examine the performance of Parallel Random
Prism on a fixed number of cluster nodes with an increasing workload (train-
ing data size). In general a linear increase in the runtime with respect to the
training data size is desired. We produced larger versions of the two datasets in
Table 1 by appending the data to itself in vertical (multiplying instances) and
horizontal directions (multiplying attributes). Please note that this appending of
data does not introduce new concepts and hence does not take influence on the
rulesets produced. This is important as altered rule sets may result in different
runtimes of the system, and hence the size-up comparison would not be reliable.
The reasoning for this way of increasing the data size is that it will not
change the concept encoded in the data. Simply taking different sized samples
from the original training data will influence the concept and thus the runtime
needed to find rules describing the concept. Appending the data to itself allows
Parallel Random Prism's runtime to be examined more precisely. The calcula-
tion of the weight of the individual R-PrismTCS classifiers might be influenced
by this way of building different sized samples as some instances may appear
in both, the training and the test set. However, this is not relevant for these
experiments, as this evaluation examines the computational performance and
not the classification accuracy. For all experiments we used 100 R-PrismTCS
base classifiers.
 
Search WWH ::




Custom Search