A Scalable Expressive Ensemble Learning Using Random Prism: A MapReduce Approach - Transactions on Large-Scale-Data-and Knowledge-Centered Systems XX

Database Reference

In-Depth Information

Bagging Procedure. The building of a boot strap sample from the training

data, using bagging, needs to be executed for each R-PrismTCS classifier in

order to create as diverse samples as possible (as required by Random Prism).

Thus bagging imposes a considerable computational overhead, which needs to be

addressed as well. In the proposed Parallel Random Prism classifier implementa-

tion, multiple bagging procedures are executed concurrently. This is realised by

integrating the bagging procedure in the Mapper that implements R-PrismTCS.

Thus the execution of p bagging procedures at the same time is concurrent, if p

is the number of available computing nodes in the cluster. The original training

is distributed to each computing node in the Hadoop cluster at the beginning of

Parallel Random Prism's execution. We have not taken influence on how Hadoop

distributes the data. However, Hadoop typically distributes chunks and redun-

dant copies of the training data across the cluster. This partition and redundancy

reduces the communication overhead as well as provides more robustness in the

case a cluster node fails. This is done in order to keep the communication over-

head low. This way the original training data only needs to be communicated

once, as the local Mappers on a computing node only need the local copy of the

training data in order to build their individual samples.

Building of Composite Classifier. The aggregation of the individual

R-PrismTCS classifiers and their associated weights is implemented in a single

Reducer. Once the individual R-PrismTCS Mappers finish the induction of their

rulesets, they send the rulesets and their associated weights to the Reducer. The

Reducer simply holds a collection of classifiers with the weight. If a new unla-

belled data instance is presented, then the Reducer applies a weighted majority

voting of each classifier, or a subset of the best classifiers (according to their

weight), in order to label the new data instance. The data that is transmitted

from the Mapper to the Reducer is relatively small in size comprising all the rules

of the induced R-PrismTCS base classifiers. Nevertheless, we have incorporated

this communication in our analysis in Sect. 4 . However, assuming that the num-

ber of R-PrismTCS classifiers is increasing, one may consider distributing the

computational and communication overhead (associated with the aggregation of

the classifiers) over several Reducers executed on different computational nodes.

Parallel Random Prism Architecture. Figure 3 shows the principal archi-

tecture of Parallel Random Prism using four Mappers, one Reducer and three

cluster nodes.

The input data (training data) is sent to each computing node. A computing

node can execute multiple Mappers. Each Mapper implements the R-PrismTCS

base classifier outlined in Algorithm 1 , creates a validation and a training set

and then produces a set of rules using the training data and a weight using

the validation data. Then each R-PrismTCS Mapper sends its ruleset and the

associated weight (determined using the validation data) to the Reducer. The

Reducer keeps a collection of the received classifiers and their weights and applies

a weighted majority voting of each, or a subset of the best classifiers, to new

Transactions on Large-Scale-Data-and Knowledge-Centered Systems XX

Search WWH ::

Custom Search

Home