Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

computed results and executing tasks on any available machine to prevent straggling

effects. On the runtime, instances of incremental Map tasks take advantage of

previously stored results by querying the memoization server. If they find that

the result has already been computed, they fetch the result from the location of

their memoized output and conclude. Similarly, the results of a Reduce task are

remembered by storing them persistently and locally where a mapping from a

collision-resistant hash of the input to the location of the output is inserted in the

memoization server.

Since a Reduce task receives input from n Map tasks, the key stored in the

memoization server consists of the hashes of the outputs from all n Map task that

collectively form the input to the Reduce task. Therefore, when executing a Reduce

task, instead of immediately copying the output from the Map tasks, the Reduce task

consults Map tasks for their respective hashes to determine if the Reduce task has

already been computed in previous run. If so, that output is directly fetched from

the location stored in the memoization server, which avoids the re-execution of that

task.

The M 3 system [ 64 ] has been proposed to support the answering of continuous

queries over streams of data bypassing the HDFS so that data gets processed

only through a main-memory-only data-path and totally avoids any disk access.

In this approach, Mappers and Reducers never terminate where there is only

one MapReduce job per query operator that is continuously executing. In M 3 ,

query processing is incremental where only the new input is processed, and the

change in the query answer is represented by three sets of inserted ( C ve ), deleted

( ve ) and updated ( u ) tuples. The query issuer receives as output a stream that

represents the deltas (incremental changes) to the answer. Whenever an input

tuple is received, it is transformed into a modify operation ( +ve , -ve or u ) that

is propagated in the query execution pipeline, producing the corresponding set of

modify operations in the answer. Supporting incremental query evaluation requires

that some intermediate state be kept at the various operators of the query execution

pipeline. Therefore, Mappers and Reducers run continuously without termination,

and hence can maintain main-memory state throughout the execution. In contrast to

splitting the input data based on its size as in Hadoops Input Split functionality, M 3

splits the streamed data based on arrival rates where the Rate Split layer, between

the main-memory buffers and the Mappers, is responsible for balancing the stream

rates among the Mappers. This layer periodically receives rate statistics from the

Mappers and accordingly redistributes the load of processing amongst Mappers. For

instance, a fast stream that can overflow one Mapper should be distributed among

two or more Mappers. In contrast, a group of slow streams that would underflow

their corresponding Mappers should be combined to feed into only one Mapper.

To support fault tolerance, input data is replicated inside the main memory buffers

and an input split is not overwritten until the corresponding Mapper commits.

When a Mapper fails, it re-reads its corresponding input split from any of the

replica inside the buffers. A Mapper writes its intermediate key-value pairs in

its own main-memory, and does not overwrite a set of key-value pairs until the

corresponding reducer commits. When a reducer fails, it re-reads its corresponding

sets of intermediate key-value pairs from the Mappers.

Cloud Data Management

Search WWH ::

Custom Search

Home