Incremental MapReduce Computations - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

a reduce task, where a map task (output 2) produces in the second but not in the first

run a value associated with the key being processed by this reduce task. In this case,

a partitioning of the input into groups with a fixed number of input files would cause

all groups of files to become different from one run to the next.

To solve this, we again employ content-based chunking, which is applied to every

level of the tree of combiners that forms the contraction phase. The way we perform

content-based chunking in the contraction phase differs slightly from the approach

we took in Inc-HDFS, for both efficiency and simplicity reasons. In particular, given

that the Hadoop framework splits the input to the contraction phase into multiple

files coming from different map tasks, we require chunk boundaries to be at file

boundaries. This way we leverage the existing input partitioning, which not only

simplifies the implementation, but also avoids reprocessing this input, since we can

use the hash of each input file to determine if a marker is present: we do this by test-

ing if the hash modulo a predetermined integer M is equal to a constant k < M .

Figure 4.4 also illustrates the importance of content-based chunking. In this

example, the marker, which delimits the boundaries between groups of input files,

is present only in outputs 5, 7, and 14. Therefore, inserting a new map output will

change the first group of inputs but none of the remaining ones. This figure also

illustrates how this change propagates to the output: it leads to a new combiner invo-

cation (labeled 1-2-3-5) and the final reduce invocation. For all the remaining com-

biners we can reuse their memoized outputs without re-executing them.

4.5 MEMOIZATION-AWARE SCHEDULER

The main job of the Hadoop scheduler is to assign map and reduce tasks to cluster

machines, taking into account machine availability, cluster topology, and the locality

of input data. However, this scheduler does not it well with Incoop because it does

not consider the location of memoized results.

The memoization-aware scheduler addresses this shortcoming. To understand the

goals that guide its design we need to consider the map and reduce phases separately.

For the map phase, the location of memoized results is irrelevant, since, in case

the map task is able to reuse these results, it can just communicate the location of the

results to the scheduler, who then points the reduce tasks to this location. Therefore,

the memoization-aware scheduler works like the Hadoop scheduler in the map phase.

For scheduling reduce tasks, which now perform the contraction phase as well as

the final reduce invocation, the memoization-aware scheduler must try to schedule

them in nodes where the memoized results they may use are stored. This is impor-

tant because the contraction phase often uses a combination of newly computed and

memoized results, whenever only a part of its inputs has changed. In addition to

this design goal, the scheduler must provide some flexibility by allowing tasks to be

scheduled on nodes that do not store memoized results, otherwise it can lead to the

presence of stragglers, that is, individual poorly performing nodes that can delay the

job completion [38].

The new scheduler for reduce tasks strikes a balance between these two goals

by being aware of the location of memoized results, while at the same time imple-

menting a simple work-stealing algorithm to adapt to varying resource availability.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home