Database Reference
In-Depth Information
(a)
Input file
HDFS
Chunk-1 Chunk-2 Chunk-3
Fixed-size
chunking
(b)
Input file
Content-based
marker
Inc-HDFS
Chunk-1
Chunk-2
Chunk-3
Content-based
chunking
(c)
Data-modification in chunk-2
Chunk-1
Chunk-2
Chunk-3
New chunk
Chunk-1
Chunk-4
Chunk-3
Inc-HDFS
FIGURE 4.2 Chunking strategies in HDFS and Inc-HDFS. (a) Fixed-size chunking in
HDFS. (b) Content-based chunking in Inc-HDFS. (c) Example of stable partitioning.
chunk size. Therefore, we collect the markers in a centralized list, and scan the list to
determine which markers are skipped; the remaining ones form the chunk boundar-
ies. Our experimental evaluation (Section 4.6.4) highlights the performance gains
of this optimization, showing that it is instrumental in keeping the performance of
Inc-HDFS close to that of HDFS.
4.4 INCREMENTAL MapReduce
This section presents our design for incremental MapReduce computations. We split
the presentation by describing the map and reduce phases separately.
Incremental map. For the map phase, the main challenges have already been
addressed by Inc-HDFS, which partitions data in such a way that the input to map
tasks ensures stability and also allows for controlling the average granularity of the
input that is provided to these tasks. In particular, this granularity can be adjusted by
changing how likely it is to find a marker, and it should be set in a way that strikes
a good balance between the following two characteristics: incurring the overhead
associated with scheduling many map tasks when the average chunk size is low, and
having to recompute a large map task if a small subset of its input changes when the
average chunk size is large.
 
Search WWH ::




Custom Search