A Scalable Expressive Ensemble Learning Using Random Prism: A MapReduce Approach - Transactions on Large-Scale-Data-and Knowledge-Centered Systems XX

Database Reference

In-Depth Information

runtime. However, considering the low discrepancy after using 10 cluster nodes

suggests that the impact of i =1 T asm,i and

is not very high and thus

the experiments are far from using the maximum number of cluster nodes that

are still beneficial to lowering the runtime. This is consistent with the theoretical

speed-up analysis in the previous section. Please note that the theoretical and

empirical analysis presented in this paper focuses on the algorithm rather than

the version of MapReduce being used. If the sample constructed constructed for

the R-PrismTCS classifier is bigger than the HDFS block size additional com-

munication overhead will be incurred and the less speedup can be achieved. The

samples constructed in the experiments outlined in this paper were not bigger

than the HDFS block size.

Loosely speaking, Parallel Random Prism indeed exhibits an linear scalability

with respect to the number of training instances and the number of features.

Furthermore, the algorithm also shows a near linear speed-up factor.

The current implementation of Parallel Random Prism is bound in its max-

imum parallelism by the number of R-PrismTCS classifiers utilised. However,

R-PrismTCS classifiers could also be parallelised. The Parallel Modular Classi-

fication Rule Induction (PMCRI) framework [ 19 ] for parallelising, amongst oth-

ers, the PrismTCS [ 5 ] classifier, can be used for parallelising the R-PrismTCS

classifier also. This is due to the similarity of the R-PrismTCS and PrismTCS

classifiers. However, this is outside the scope of this paper.

T comdat · p

6 Conclusions

This paper presented work on a novel, well-scaling ensemble classifier called Par-

allel Random Prism. Ensemble classifiers exhibit a very high predictive accuracy

compared with standalone classifiers, especially in noisy domains. However, this

increase in performance is at the expense of computational eciency due to

data replication and the induction of multiple classifiers. Thus ensemble classi-

fiers applied on modest size training data already challenge the computational

hardware. Section 2 highlighted alternative base classifiers to decision trees (on

which most ensemble classifiers are based), in particular the Prism approach. The

PrismTCS standalone classifier often outperforms decision trees when applied to

noisy data, and hence is a good candidate base classifier for ensemble classifiers.

Section 2 proposed the Random Prism ensemble learner with the PrismTCS

based R-PrismTCS base classifier. It summarised results concerning classifica-

tion accuracy and gave an initial empirical estimate of Random Prism's runtime

requirements. Section 3 also highlighted a parallel version of Random Prism using

the Hadoop implementation of the MapReduce programming paradigm. Essen-

tially multiple R-PrismTCS base classifiers are executed concurrently on

com-

puting nodes in a Hadoop cluster. The only aspects of Random Prism that are

not parallelised are the inexpensive combining procedure of the individual classi-

fiers and the distribution of the original training data over the cluster. Section 4

gave a theoretical complexity analysis of Random Prism and a theoretical scala-

bility analysis of Parallel Random Prism. The parallel version of Random Prism

p

Transactions on Large-Scale-Data-and Knowledge-Centered Systems XX

Search WWH ::

Custom Search

Home