Database Reference
In-Depth Information
longer but not 100 times longer. In this particular case the increasing factor
b
would be 100. However, factors
are diminishing and thus have a short-
ening influence on the runtime. In general as none of these factors comprises an
increasing dependence on
f
and
d
N
or
M
, this can be approximated to an overall linear
complexity of
). Please note that the complexity of building the com-
posite classifier is not dependent on the training data size but on the number of
classifiers. Also building the composite classifier is a computationally relatively
inexpensive operation. The bagging is also of linear complexity
O
(
N · M
O
(
N
) assuming
that the bag is of size
, as in Random Prism.
Stepping away from the complexity, the actual runtime
N
T total , which is needed
to execute the serial version of Random Prism, can be described by:
b
T total =
(
T sam,i +
T cla,i +
T asm,i )
i
=1
where
T total is the total serial runtime,
b
is the number of base classifiers,
T sam,i
is the time needed for sampling (using bagging) for classifier
i
;
T cla,i
is the
execution time for classifier
i
and
T asm,i is the time needed to integrate classifier
i s
T total will be used as
a base for describing Parallel Random Prism's runtime requirements.
As discussed, the basic Random Prism total runtime description
ruleset into the composite classifier. This description of
T total can
be extended for describing the Parallel Random Prism runtime as shown in the
equation below, where
p
is the number of computing nodes in the Hadoop cluster:
b
i =1 T sam,i
p
b
i =1 T cla,i
p
b
i =1 T comres,i
p
b
i =1 T asm,i
r
T total (
p
)=
T comdat · p
+
+
+
+
T comdat · p
is a new term that describes the time needed to communicate the
training data to
p
mappers. p is defined as
p
=
n · δ
, where
n
is the number
of computational nodes int the cluster and
δ
is the number of mappers hosted
per
T comres is also a new term that describes the time needed to commu-
nicate the R-PrismTCS rulesets and weights to the Reducer.
n
.
T cla,i and
T asm,i are the same terms as in the equation for the serial Random Prism algo-
rithm. However, in the parallel version
T sam,i ,
T cla,i
(R-PrismTCS induction) are executed concurrently using multiple Mappers on
p
T sam,i (sampling using bagging) and
processors and hence their runtime can be divided by
p
.
T asm,i
(assembling of the composite classifier) is executed on
r
Reducers in
the Hadoop cluster. Hence the division by
. However, in the setup used for
the experiments in Sect. 5 only one reducer has been used, hence
r
r
= 1 in the
empirical results. The reason for setting
r
= 1 is because the computational
requirement for
T asm,i is very low. The only two terms that are not parallelised
are
T comdat · p
and
T asm,i
and thus these present a computational bottleneck.
However, for term
, the data transmitted to each node is a copy
of the original data and it is assumed that the time needed to perform the
transmission to each node is the same. Further assume that a star topology
T comdat · p
Search WWH ::




Custom Search