Database Reference
In-Depth Information
Bagging Procedure. The building of a boot strap sample from the training
data, using bagging, needs to be executed for each R-PrismTCS classifier in
order to create as diverse samples as possible (as required by Random Prism).
Thus bagging imposes a considerable computational overhead, which needs to be
addressed as well. In the proposed Parallel Random Prism classifier implementa-
tion, multiple bagging procedures are executed concurrently. This is realised by
integrating the bagging procedure in the Mapper that implements R-PrismTCS.
Thus the execution of p bagging procedures at the same time is concurrent, if p
is the number of available computing nodes in the cluster. The original training
is distributed to each computing node in the Hadoop cluster at the beginning of
Parallel Random Prism's execution. We have not taken influence on how Hadoop
distributes the data. However, Hadoop typically distributes chunks and redun-
dant copies of the training data across the cluster. This partition and redundancy
reduces the communication overhead as well as provides more robustness in the
case a cluster node fails. This is done in order to keep the communication over-
head low. This way the original training data only needs to be communicated
once, as the local Mappers on a computing node only need the local copy of the
training data in order to build their individual samples.
Building of Composite Classifier. The aggregation of the individual
R-PrismTCS classifiers and their associated weights is implemented in a single
Reducer. Once the individual R-PrismTCS Mappers finish the induction of their
rulesets, they send the rulesets and their associated weights to the Reducer. The
Reducer simply holds a collection of classifiers with the weight. If a new unla-
belled data instance is presented, then the Reducer applies a weighted majority
voting of each classifier, or a subset of the best classifiers (according to their
weight), in order to label the new data instance. The data that is transmitted
from the Mapper to the Reducer is relatively small in size comprising all the rules
of the induced R-PrismTCS base classifiers. Nevertheless, we have incorporated
this communication in our analysis in Sect. 4 . However, assuming that the num-
ber of R-PrismTCS classifiers is increasing, one may consider distributing the
computational and communication overhead (associated with the aggregation of
the classifiers) over several Reducers executed on different computational nodes.
Parallel Random Prism Architecture. Figure 3 shows the principal archi-
tecture of Parallel Random Prism using four Mappers, one Reducer and three
cluster nodes.
The input data (training data) is sent to each computing node. A computing
node can execute multiple Mappers. Each Mapper implements the R-PrismTCS
base classifier outlined in Algorithm 1 , creates a validation and a training set
and then produces a set of rules using the training data and a weight using
the validation data. Then each R-PrismTCS Mapper sends its ruleset and the
associated weight (determined using the validation data) to the Reducer. The
Reducer keeps a collection of the received classifiers and their weights and applies
a weighted majority voting of each, or a subset of the best classifiers, to new
Search WWH ::




Custom Search