Database Reference
In-Depth Information
The scheduler maintains a separate queue of pending reduce tasks for each node
in the cluster (instead of a single queue for all nodes). Each queue is populated with
the tasks that should run on that node to exploit the location of memoized results.
Whenever a node requests more work, the scheduler dequeues the first task from the
corresponding queue and assigns the task to the node for execution. In case the queue
for the requesting node is empty, the scheduler attempts to steal work from other task
queues, by choosing a pending task from the task queue with maximum length. Our
experimental evaluation (Section 4.6.6) shows the effectiveness of the new scheduler.
4.6 IMPLEMENTATION AND EVALUATION
This section describes the implementation and experimental evaluation of Incoop.
4.6.1 i mPlementation
We built a prototype of Incoop based on Hadoop-0.20.2. The implementation of Inc-
HDFS extends HDFS with stable input partitioning, and incremental MapReduce
extends Hadoop with support for memoization, the contraction phase, and the
memoization-aware scheduler.
The Inc-HDFS file system keeps exactly the same interface and semantics for all
existing HDFS calls, and implements the parallel scanning scheme to find markers
that we described in Section 4.3. For our experiments, we set the minimum chunk
size (i.e., the number of Bytes skipped after finding a marker) to 40 MB, unless oth-
erwise noted. In HDFS, we use a chunk size of 64 MB.
The memoization server is built as a wrapper around the in-memory key/value
store memcached v1.4.5 system. The memcached server is colocated with the name
node, which is the directory server from Hadoop. All the memoized results are stored
on Inc-HDFS with a replication factor of one. This implies that in case of a data node
crash these results need to be recomputed. We implemented a simple garbage collec-
tion scheme to discard old memoized results, which only retains the results from the
most recent run of a given MapReduce job, and discards all results from previous runs.
Finally, the contraction phase is implemented by aggregating all keys that are pro-
cessed by each node, instead of building one contraction tree for each key processed
by that node, since this is closer to the original Hadoop implementation.
4.6.2 a PPliCations
We evaluated Incoop using a set of MapReduce applications from the open-source
Apache Mahout project summarized in Table 4.1. These applications cover a wide
range of domains such as machine learning, natural language processing, pattern rec-
ognition, and document analysis. Furthermore, this set includes both data-intensive
( WordCount , CoMatrix , BiCount ), and CPU-intensive ( KNN and K-Means )
computations, corresponding to different ratios of I/O to CPU load. We did not have
to modify any of these applications to work with Incoop.
The three data-intensive applications use documents written in a natural lan-
guage as input. In our benchmarks, we used as input the contents of Wikipedia
Search WWH ::




Custom Search