A Scalable Expressive Ensemble Learning Using Random Prism: A MapReduce Approach - Transactions on Large-Scale-Data-and Knowledge-Centered Systems XX

Database Reference

In-Depth Information

longer but not 100 times longer. In this particular case the increasing factor

would be 100. However, factors

are diminishing and thus have a short-

ening influence on the runtime. In general as none of these factors comprises an

increasing dependence on

and

, this can be approximated to an overall linear

complexity of

). Please note that the complexity of building the com-

posite classifier is not dependent on the training data size but on the number of

classifiers. Also building the composite classifier is a computationally relatively

inexpensive operation. The bagging is also of linear complexity

(

N · M

(

) assuming

that the bag is of size

, as in Random Prism.

Stepping away from the complexity, the actual runtime

T total , which is needed

to execute the serial version of Random Prism, can be described by:

T total =

(

T sam,i +

T cla,i +

T asm,i )

where

T total is the total serial runtime,

is the number of base classifiers,

T sam,i

is the time needed for sampling (using bagging) for classifier

;

T cla,i

is the

execution time for classifier

and

T asm,i is the time needed to integrate classifier

i s

T total will be used as

a base for describing Parallel Random Prism's runtime requirements.

As discussed, the basic Random Prism total runtime description

ruleset into the composite classifier. This description of

T total can

be extended for describing the Parallel Random Prism runtime as shown in the

equation below, where

is the number of computing nodes in the Hadoop cluster:

i =1 T sam,i

i =1 T cla,i

i =1 T comres,i

i =1 T asm,i

T total (

T comdat · p

is a new term that describes the time needed to communicate the

training data to

mappers. p is defined as

n · δ

, where

is the number

of computational nodes int the cluster and

is the number of mappers hosted

per

T comres is also a new term that describes the time needed to commu-

nicate the R-PrismTCS rulesets and weights to the Reducer.

T cla,i and

T asm,i are the same terms as in the equation for the serial Random Prism algo-

rithm. However, in the parallel version

T sam,i ,

T cla,i

(R-PrismTCS induction) are executed concurrently using multiple Mappers on

T sam,i (sampling using bagging) and

processors and hence their runtime can be divided by

T asm,i

(assembling of the composite classifier) is executed on

Reducers in

the Hadoop cluster. Hence the division by

. However, in the setup used for

the experiments in Sect. 5 only one reducer has been used, hence

= 1 in the

empirical results. The reason for setting

= 1 is because the computational

requirement for

T asm,i is very low. The only two terms that are not parallelised

are

T comdat · p

and

T asm,i

and thus these present a computational bottleneck.

However, for term

, the data transmitted to each node is a copy

of the original data and it is assumed that the time needed to perform the

transmission to each node is the same. Further assume that a star topology

T comdat · p

Transactions on Large-Scale-Data-and Knowledge-Centered Systems XX

Search WWH ::

Custom Search

Home