Database Reference
In-Depth Information
the required memory would be 100 times larger compared with the memory
requirements of the standalone PrismTCS classifier. The CPU requirements of
Random Prism are high, but not 100 times higher due to the random feature
subset selection. The parallelisation of the algorithm allows harvesting of the
memory and CPU time of multiple workstations for inducing the Random Prism
ensemble classifier.
In data parallelism smaller portions of the data are distributed to different
computing nodes on which data mining tasks are executed concurrently [ 23 ].
Ensemble learning lends itself to data parallelism as it is composed of many dif-
ferent data mining tasks, the induction of base classifiers, which can be executed
independently, and thus concurrently. Hence, a data parallel approach has been
chosen for Random Prism. However, there are some limiting factors concerning
scalability, which will be analysed in Sect. 4 .
Section 3.1 highlights the MapReduce paradigm which has been adopted for
the parallelisation of Random Prism, and Sect. 3.2 highlights the architecture of
Parallel Random Prism.
3.1 Parallelisation Using the MapReduce Paradigm
A programming paradigm for parallel processing introduced by Google is MapRe-
duce [ 10 ]. It provides a simple way of developing 'data' parallel data mining
techniques and thus lends itself to the parallel development of ensemble learn-
ers [ 17 ]. In addition, MapReduce computer cluster implementations, such as
the open source Hadoop implementation [ 1 ] provide fault tolerance and auto-
matic workload balancing. Hadoop's MapReduce implementation is based on
the Hadoop Distributed File System (HDFS), which distributes the data over
the computer cluster and stores it redundantly in order to speed up the data
access and establish fault tolerance.
Figure 2 illustrates a Hadoop computer cluster. MapReduce partitions an
application into smaller parts implemented as Mapper components. Mappers
can be processed by any computing node within a MapReduce cluster. The
aggregation of the results produced by the Mappers is implemented in one or
more Reducer components, which again can be processed by any computing node
within a MapReduce cluster.
MapReduce's significance in the area of data mining is evident through its
adoption for many data mining tasks and projects in science as well as in busi-
nesses. For example, by 2008 Google made use of MapReduce in over 900 projects
[ 10 ], such as clustering of images for identifying duplicates [ 16 ]. In 2009 the
authors of [ 17 ] used MapReduce in order to induce and assemble numerous
ensemble trees in parallel.
Random Prism can be broken down into multiple R-PrismTCS classifiers
induced on bagged samples of the training data. Loosely speaking, Random
Prism can be parallelised using Hadoop through implementing R-PrismTCS clas-
sifiers as Mappers which can be executed concurrently in a MapReduce cluster.
More details on the Parallel Random Prism architecture are highlighted next in
Sect. 3.2 .
Search WWH ::




Custom Search