Incremental MapReduce Computations - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

reused. To ensure this, we introduce a modification to the scheduler used by Hadoop,

to incorporate a notion of affinity . The new scheduler takes into account affinities

between machines and tasks by keeping a record of which nodes have executed

which tasks. This allows for scheduling tasks in a way that decreases the movement

of memoized intermediate results, but at the cost of a potential degradation of job

performance due to stragglers [38]. This is because a strict affinity of tasks results

in deterministic scheduling, which prevents a lightly loaded node from performing

work when the predetermined node is heavily loaded. Our scheduler therefore needs

to strike a balance between work stealing and affinity of memoized results. Section

4.5 describes our modified scheduler.

4.3 INCREMENTAL HDFS

In this section, we present tncremental HDFS (Inc-HDFS), a distributed file system

that enables stable incremental computations in Incoop, while keeping the interface

provided by HDFS. Inc-HDFS builds on HDFS, but modifies the way that files are

partitioned into chunks to use content-based chunking, a technique that was intro-

duced in LBFS [31] for data deduplication. At a high-level, content-based chunking

defines chunk boundaries based on finding certain patterns in the input, instead of

using fixed-size chunks. As such, insertions and deletions cause small changes to the

set of chunks. In the context of MapReduce, this ensures that the input to map tasks

remains mostly unchanged, which translates into a stable recomputation. Figure 4.2

illustrates the differences in the strategies for determining chunk boundaries in

HDFS and Inc-HDFS. To perform content-based chunking, we scan the entire file,

examining the contents of a fixed-width window whose initial position is incre-

mented one byte at a time. For each window, we compute its Rabin fingerprint, and if

the fingerprint matches a certain pattern (called a marker ) we place a chunk bound-

ary at that position. In addition, this approach can be extended to avoid creating

chunks that are too small or too large, which could affect the overheads and load

balancing properties of MapReduce. (Note that all the system designer can tune is

the likelihood of finding a marker, but the actual spacing depends on the input.) This

is achieved by setting minimum and maximum chunk sizes: after we find a marker

m i at position p i , we skip a fixed offset O and continue to scan the input after position

p i + O . In addition, we bound the chunk length by setting a marker after M content

bytes even if no marker is found. Despite the possibility of affecting stability in rare

cases, for example, when skipping the offset leads to skipping disjoint sets of mark-

ers in two consecutive runs, we found this to be a very limited problem in practice.

An important design decision is whether to perform chunking during the creation

of the input or when the input is read by the map task. We chose the former because

the cost of chunking can be amortized when chunking and producing the input data

are done in parallel. This is relevant in cases where the generation of input data is not

limited by the storage throughput.

To parallelize the chunking process on multicore machines, our implementation

uses multiple threads, each of which starts the search for the marker at a different

position. The markers that each thread finds cannot be used immediately to define the

chunk boundaries, since some of them might have to be skipped due to the minimum

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home