Advanced Decision Trees - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

all instances are referenced to the root node. The attribute lists for the

numeric features are then sorted independently.

11.8.2

Parallel Induction of Tree

PLANET is a recent classification and regression tree algorithm from

Google based on MapReduce. The basic idea of the parallel algorithm is to

iteratively build the regression tree in a breadth first manner, one complete

level at a time in a distributed environment, until the data partitions are

small enough to fit in memory and a “leaf” sub-tree can be constructed

locally on a single machine

Panda et al . (2009) proposed a parallel implemention of tree induction

with MapReduce framework. MapReduce is one of the most popular parallel

programming frameworks for data mining. In this framework, programmers

need to specify a map function which processes key/value pairs to emit a

set of intermediate key/value pairs and a reduce function that aggregates

all intermediate values associated with the same intermediate key. The

MapReduce framework was pioneered by Google and then popularized

by open-source Apache Hadoop project. While there are other parallel

programming frameworks (such as CUDA and MPI), MapReduce has

become the industry standard and implemented on cloud computing

services such as Amazon EC2 and various companies such as Cloudera

provide services to ease Hadoop deployment.

The greedy nature of tree induction algorithm does not scale well to

a large dataset. For example to find the best split in the root node, it

is required to go over all instances in the dataset. And when the entire

dataset does not fit in main memory, this full scan becomes a bottleneck

because of the cost of scanning data from secondary storage. The basic

idea of PLANET is to iteratively grow the decision tree, one layer at a

time, until the data partitions are small enough to fit the main memory

and the remaining sub-tree can be grown locally on a single machine. For

the higher levels the key idea of PLANET is that the splitting criterion

at a certain node does not need the entire dataset but a compact data

structure of sucient statistics which in most cases can be fit-in-memory.

These statistics are calculated on the mappers.

Most recently, Yin et al . (2012) presented OpenPlanet, an open

implementation of a scalable regression tree machine learning algorithm

on Hadoop.

Search WWH ::

Custom Search

Home