Database Reference
In-Depth Information
a reduce task, where a map task (output 2) produces in the second but not in the first
run a value associated with the key being processed by this reduce task. In this case,
a partitioning of the input into groups with a fixed number of input files would cause
all groups of files to become different from one run to the next.
To solve this, we again employ content-based chunking, which is applied to every
level of the tree of combiners that forms the contraction phase. The way we perform
content-based chunking in the contraction phase differs slightly from the approach
we took in Inc-HDFS, for both efficiency and simplicity reasons. In particular, given
that the Hadoop framework splits the input to the contraction phase into multiple
files coming from different map tasks, we require chunk boundaries to be at file
boundaries. This way we leverage the existing input partitioning, which not only
simplifies the implementation, but also avoids reprocessing this input, since we can
use the hash of each input file to determine if a marker is present: we do this by test-
ing if the hash modulo a predetermined integer M is equal to a constant k < M .
Figure 4.4 also illustrates the importance of content-based chunking. In this
example, the marker, which delimits the boundaries between groups of input files,
is present only in outputs 5, 7, and 14. Therefore, inserting a new map output will
change the first group of inputs but none of the remaining ones. This figure also
illustrates how this change propagates to the output: it leads to a new combiner invo-
cation (labeled 1-2-3-5) and the final reduce invocation. For all the remaining com-
biners we can reuse their memoized outputs without re-executing them.
4.5 MEMOIZATION-AWARE SCHEDULER
The main job of the Hadoop scheduler is to assign map and reduce tasks to cluster
machines, taking into account machine availability, cluster topology, and the locality
of input data. However, this scheduler does not it well with Incoop because it does
not consider the location of memoized results.
The memoization-aware scheduler addresses this shortcoming. To understand the
goals that guide its design we need to consider the map and reduce phases separately.
For the map phase, the location of memoized results is irrelevant, since, in case
the map task is able to reuse these results, it can just communicate the location of the
results to the scheduler, who then points the reduce tasks to this location. Therefore,
the memoization-aware scheduler works like the Hadoop scheduler in the map phase.
For scheduling reduce tasks, which now perform the contraction phase as well as
the final reduce invocation, the memoization-aware scheduler must try to schedule
them in nodes where the memoized results they may use are stored. This is impor-
tant because the contraction phase often uses a combination of newly computed and
memoized results, whenever only a part of its inputs has changed. In addition to
this design goal, the scheduler must provide some flexibility by allowing tasks to be
scheduled on nodes that do not store memoized results, otherwise it can lead to the
presence of stragglers, that is, individual poorly performing nodes that can delay the
job completion [38].
The new scheduler for reduce tasks strikes a balance between these two goals
by being aware of the location of memoized results, while at the same time imple-
menting a simple work-stealing algorithm to adapt to varying resource availability.
Search WWH ::




Custom Search