Database Reference
In-Depth Information
runtime. However, considering the low discrepancy after using 10 cluster nodes
suggests that the impact of i =1 T asm,i and
is not very high and thus
the experiments are far from using the maximum number of cluster nodes that
are still beneficial to lowering the runtime. This is consistent with the theoretical
speed-up analysis in the previous section. Please note that the theoretical and
empirical analysis presented in this paper focuses on the algorithm rather than
the version of MapReduce being used. If the sample constructed constructed for
the R-PrismTCS classifier is bigger than the HDFS block size additional com-
munication overhead will be incurred and the less speedup can be achieved. The
samples constructed in the experiments outlined in this paper were not bigger
than the HDFS block size.
Loosely speaking, Parallel Random Prism indeed exhibits an linear scalability
with respect to the number of training instances and the number of features.
Furthermore, the algorithm also shows a near linear speed-up factor.
The current implementation of Parallel Random Prism is bound in its max-
imum parallelism by the number of R-PrismTCS classifiers utilised. However,
R-PrismTCS classifiers could also be parallelised. The Parallel Modular Classi-
fication Rule Induction (PMCRI) framework [ 19 ] for parallelising, amongst oth-
ers, the PrismTCS [ 5 ] classifier, can be used for parallelising the R-PrismTCS
classifier also. This is due to the similarity of the R-PrismTCS and PrismTCS
classifiers. However, this is outside the scope of this paper.
T comdat ยท p
6 Conclusions
This paper presented work on a novel, well-scaling ensemble classifier called Par-
allel Random Prism. Ensemble classifiers exhibit a very high predictive accuracy
compared with standalone classifiers, especially in noisy domains. However, this
increase in performance is at the expense of computational eciency due to
data replication and the induction of multiple classifiers. Thus ensemble classi-
fiers applied on modest size training data already challenge the computational
hardware. Section 2 highlighted alternative base classifiers to decision trees (on
which most ensemble classifiers are based), in particular the Prism approach. The
PrismTCS standalone classifier often outperforms decision trees when applied to
noisy data, and hence is a good candidate base classifier for ensemble classifiers.
Section 2 proposed the Random Prism ensemble learner with the PrismTCS
based R-PrismTCS base classifier. It summarised results concerning classifica-
tion accuracy and gave an initial empirical estimate of Random Prism's runtime
requirements. Section 3 also highlighted a parallel version of Random Prism using
the Hadoop implementation of the MapReduce programming paradigm. Essen-
tially multiple R-PrismTCS base classifiers are executed concurrently on
com-
puting nodes in a Hadoop cluster. The only aspects of Random Prism that are
not parallelised are the inexpensive combining procedure of the individual classi-
fiers and the distribution of the original training data over the cluster. Section 4
gave a theoretical complexity analysis of Random Prism and a theoretical scala-
bility analysis of Parallel Random Prism. The parallel version of Random Prism
p
Search WWH ::




Custom Search