Database Reference
In-Depth Information
all instances are referenced to the root node. The attribute lists for the
numeric features are then sorted independently.
11.8.2
Parallel Induction of Tree
PLANET is a recent classification and regression tree algorithm from
Google based on MapReduce. The basic idea of the parallel algorithm is to
iteratively build the regression tree in a breadth first manner, one complete
level at a time in a distributed environment, until the data partitions are
small enough to fit in memory and a “leaf” sub-tree can be constructed
locally on a single machine
Panda et al . (2009) proposed a parallel implemention of tree induction
with MapReduce framework. MapReduce is one of the most popular parallel
programming frameworks for data mining. In this framework, programmers
need to specify a map function which processes key/value pairs to emit a
set of intermediate key/value pairs and a reduce function that aggregates
all intermediate values associated with the same intermediate key. The
MapReduce framework was pioneered by Google and then popularized
by open-source Apache Hadoop project. While there are other parallel
programming frameworks (such as CUDA and MPI), MapReduce has
become the industry standard and implemented on cloud computing
services such as Amazon EC2 and various companies such as Cloudera
provide services to ease Hadoop deployment.
The greedy nature of tree induction algorithm does not scale well to
a large dataset. For example to find the best split in the root node, it
is required to go over all instances in the dataset. And when the entire
dataset does not fit in main memory, this full scan becomes a bottleneck
because of the cost of scanning data from secondary storage. The basic
idea of PLANET is to iteratively grow the decision tree, one layer at a
time, until the data partitions are small enough to fit the main memory
and the remaining sub-tree can be grown locally on a single machine. For
the higher levels the key idea of PLANET is that the splitting criterion
at a certain node does not need the entire dataset but a compact data
structure of sucient statistics which in most cases can be fit-in-memory.
These statistics are calculated on the mappers.
Most recently, Yin et al . (2012) presented OpenPlanet, an open
implementation of a scalable regression tree machine learning algorithm
on Hadoop.
Search WWH ::




Custom Search