Incremental MapReduce Computations - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

The scheduler maintains a separate queue of pending reduce tasks for each node

in the cluster (instead of a single queue for all nodes). Each queue is populated with

the tasks that should run on that node to exploit the location of memoized results.

Whenever a node requests more work, the scheduler dequeues the first task from the

corresponding queue and assigns the task to the node for execution. In case the queue

for the requesting node is empty, the scheduler attempts to steal work from other task

queues, by choosing a pending task from the task queue with maximum length. Our

experimental evaluation (Section 4.6.6) shows the effectiveness of the new scheduler.

4.6 IMPLEMENTATION AND EVALUATION

This section describes the implementation and experimental evaluation of Incoop.

4.6.1 i mPlementation

We built a prototype of Incoop based on Hadoop-0.20.2. The implementation of Inc-

HDFS extends HDFS with stable input partitioning, and incremental MapReduce

extends Hadoop with support for memoization, the contraction phase, and the

memoization-aware scheduler.

The Inc-HDFS file system keeps exactly the same interface and semantics for all

existing HDFS calls, and implements the parallel scanning scheme to find markers

that we described in Section 4.3. For our experiments, we set the minimum chunk

size (i.e., the number of Bytes skipped after finding a marker) to 40 MB, unless oth-

erwise noted. In HDFS, we use a chunk size of 64 MB.

The memoization server is built as a wrapper around the in-memory key/value

store memcached v1.4.5 system. The memcached server is colocated with the name

node, which is the directory server from Hadoop. All the memoized results are stored

on Inc-HDFS with a replication factor of one. This implies that in case of a data node

crash these results need to be recomputed. We implemented a simple garbage collec-

tion scheme to discard old memoized results, which only retains the results from the

most recent run of a given MapReduce job, and discards all results from previous runs.

Finally, the contraction phase is implemented by aggregating all keys that are pro-

cessed by each node, instead of building one contraction tree for each key processed

by that node, since this is closer to the original Hadoop implementation.

4.6.2 a PPliCations

We evaluated Incoop using a set of MapReduce applications from the open-source

Apache Mahout project summarized in Table 4.1. These applications cover a wide

range of domains such as machine learning, natural language processing, pattern rec-

ognition, and document analysis. Furthermore, this set includes both data-intensive

( WordCount , CoMatrix , BiCount ), and CPU-intensive ( KNN and K-Means )

computations, corresponding to different ratios of I/O to CPU load. We did not have

to modify any of these applications to work with Incoop.

The three data-intensive applications use documents written in a natural lan-

guage as input. In our benchmarks, we used as input the contents of Wikipedia

Search WWH ::

Custom Search

Home