A Scalable Expressive Ensemble Learning Using Random Prism: A MapReduce Approach - Transactions on Large-Scale-Data-and Knowledge-Centered Systems XX

Database Reference

In-Depth Information

5 Empirical Scalability Study

The empirical study comprises size-up and speed-up experiments on several

benchmark datasets. Size-up experiments examine the algorithm's performance

(runtime) with respect to the size of the training data; and speed-up experiments

examine the algorithm's performance with respect to the number of computing

nodes used, using speed-up factors as highlighted in the previous section. For

the experiments we used two synthetic datasets from the infobiotics data repos-

itory [ 2 ]. We have chosen these datasets as they can still be run on a single

computing node in our cluster, which can be used as a reference point. The

datasets are outlined in Table 1 . The Hadoop cluster is hosted on 10 identical

off the shelf workstations, each comprising 1 GB memory, 2.8 GHz CPUs and

a XUbuntu operating system. The Hadoop version installed on the cluster was

0.20.203.0rc1. All experiments highlighted in this section measure the total run-

time from the loading of the data to the cluster, up to aggregating the results

at the Reducer.

Table 1. Datasets used for evaluation. Attributes hold double values and class values

are represented by a single character.

Test data Number of data instances Number of attributes Number of classes

1

50000

5

2

30000

3

2

Again, size-up experiments examine the performance of Parallel Random

Prism on a fixed number of cluster nodes with an increasing workload (train-

ing data size). In general a linear increase in the runtime with respect to the

training data size is desired. We produced larger versions of the two datasets in

Table 1 by appending the data to itself in vertical (multiplying instances) and

horizontal directions (multiplying attributes). Please note that this appending of

data does not introduce new concepts and hence does not take influence on the

rulesets produced. This is important as altered rule sets may result in different

runtimes of the system, and hence the size-up comparison would not be reliable.

The reasoning for this way of increasing the data size is that it will not

change the concept encoded in the data. Simply taking different sized samples

from the original training data will influence the concept and thus the runtime

needed to find rules describing the concept. Appending the data to itself allows

Parallel Random Prism's runtime to be examined more precisely. The calcula-

tion of the weight of the individual R-PrismTCS classifiers might be influenced

by this way of building different sized samples as some instances may appear

in both, the training and the test set. However, this is not relevant for these

experiments, as this evaluation examines the computational performance and

not the classification accuracy. For all experiments we used 100 R-PrismTCS

base classifiers.

Transactions on Large-Scale-Data-and Knowledge-Centered Systems XX

Search WWH ::

Custom Search

Home