Incremental MapReduce Computations - Large Scale and Big Data: Processing and Management - page 134

Database Reference

In-Depth Information

Therefore, the main job of map tasks in Incoop is to implement task-level memoiza-

tion. To do this, after a map task runs, we store its results persistently (instead of

discarding them after the job execution) and insert a corresponding reference to the

result in the memoization server.

During incremental runs, map tasks query the memoization server to determine

if their output has already been computed. If so, they output the location of the

memoized result, and conclude. Figure 4.3 illustrates this process: part (a) describes

the initial run and part (b) describes the incremental run where chunk 2 is modi-

fied (and replaced by chunk 4) and the map tasks for chunks 1 and 3 can reuse the

memoized results.

Incremental reduce. The reduce task processes the output of the map phase:

each reduce task has an associated key k , collects all the key-value pairs generated

by all map tasks for k , and applies the reduce function. For efficiency, we apply two

levels of memoization in this case. First, we memoize the inputs and outputs of the

entire reduce task to try to reuse these results in a single step. Second, we break down

the reduce phase into a contraction phase followed by a smaller invocation of the

reduce function to address the stability issues we discussed.

The first level of memoization is very similar to that of map tasks: the memoiza-

tion server maintains a mapping from a hash of the input to the location of the result

of the reduce task. A minor difference is that a reduce task receives input from

several map tasks, and as such the key of that mapping is the concatenation of the

collision-resistant hashes from all these outputs. For the reduce task to compute this

key, instead of immediately copying the output from all map tasks, it fetches the

hashes only to determine if the reduce task can be skipped entirely. Only if this is

not the case the data is transferred from map to reduce tasks.

As we mentioned, this first level has the limitation that small changes in the input

cause the entire reduce task to be re-executed, which can result in work that is linear

(a)

(b)

Chunk-1

Chunk-2

Chunk-3

Chunk-1

Chunk-4

Chunk-3

Execute job

Execute job

Job

tracker

Job

tracker

Chunk-1

Chunk-2

Chunk-3

Chunk-1

Chunk-4

Chunk-3

Map-1

Map-2

Map-3

Map-1

Map-2

Map-3

FIGURE 4.3

Incremental map tasks. (a) Initial run. (b) Incremental run.

Next Page

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home