Incremental MapReduce Computations - Large Scale and Big Data: Processing and Management - page 133

Database Reference

In-Depth Information

(a)

Input file

HDFS

Chunk-1 Chunk-2 Chunk-3

Fixed-size

chunking

(b)

Input file

Content-based

marker

Inc-HDFS

Chunk-1

Chunk-2

Chunk-3

Content-based

chunking

(c)

Data-modification in chunk-2

Chunk-1

Chunk-2

Chunk-3

New chunk

Chunk-1

Chunk-4

Chunk-3

Inc-HDFS

FIGURE 4.2 Chunking strategies in HDFS and Inc-HDFS. (a) Fixed-size chunking in

HDFS. (b) Content-based chunking in Inc-HDFS. (c) Example of stable partitioning.

chunk size. Therefore, we collect the markers in a centralized list, and scan the list to

determine which markers are skipped; the remaining ones form the chunk boundar-

ies. Our experimental evaluation (Section 4.6.4) highlights the performance gains

of this optimization, showing that it is instrumental in keeping the performance of

Inc-HDFS close to that of HDFS.

4.4 INCREMENTAL MapReduce

This section presents our design for incremental MapReduce computations. We split

the presentation by describing the map and reduce phases separately.

Incremental map. For the map phase, the main challenges have already been

addressed by Inc-HDFS, which partitions data in such a way that the input to map

tasks ensures stability and also allows for controlling the average granularity of the

input that is provided to these tasks. In particular, this granularity can be adjusted by

changing how likely it is to find a marker, and it should be set in a way that strikes

a good balance between the following two characteristics: incurring the overhead

associated with scheduling many map tasks when the average chunk size is low, and

having to recompute a large map task if a small subset of its input changes when the

average chunk size is large.

Next Page

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home