Database Reference
In-Depth Information
computed results and executing tasks on any available machine to prevent straggling
effects. On the runtime, instances of incremental Map tasks take advantage of
previously stored results by querying the memoization server. If they find that
the result has already been computed, they fetch the result from the location of
their memoized output and conclude. Similarly, the results of a Reduce task are
remembered by storing them persistently and locally where a mapping from a
collision-resistant hash of the input to the location of the output is inserted in the
memoization server.
Since a Reduce task receives input from n Map tasks, the key stored in the
memoization server consists of the hashes of the outputs from all n Map task that
collectively form the input to the Reduce task. Therefore, when executing a Reduce
task, instead of immediately copying the output from the Map tasks, the Reduce task
consults Map tasks for their respective hashes to determine if the Reduce task has
already been computed in previous run. If so, that output is directly fetched from
the location stored in the memoization server, which avoids the re-execution of that
task.
The M 3 system [ 64 ] has been proposed to support the answering of continuous
queries over streams of data bypassing the HDFS so that data gets processed
only through a main-memory-only data-path and totally avoids any disk access.
In this approach, Mappers and Reducers never terminate where there is only
one MapReduce job per query operator that is continuously executing. In M 3 ,
query processing is incremental where only the new input is processed, and the
change in the query answer is represented by three sets of inserted ( C ve ), deleted
( ve ) and updated ( u ) tuples. The query issuer receives as output a stream that
represents the deltas (incremental changes) to the answer. Whenever an input
tuple is received, it is transformed into a modify operation ( +ve , -ve or u ) that
is propagated in the query execution pipeline, producing the corresponding set of
modify operations in the answer. Supporting incremental query evaluation requires
that some intermediate state be kept at the various operators of the query execution
pipeline. Therefore, Mappers and Reducers run continuously without termination,
and hence can maintain main-memory state throughout the execution. In contrast to
splitting the input data based on its size as in Hadoops Input Split functionality, M 3
splits the streamed data based on arrival rates where the Rate Split layer, between
the main-memory buffers and the Mappers, is responsible for balancing the stream
rates among the Mappers. This layer periodically receives rate statistics from the
Mappers and accordingly redistributes the load of processing amongst Mappers. For
instance, a fast stream that can overflow one Mapper should be distributed among
two or more Mappers. In contrast, a group of slow streams that would underflow
their corresponding Mappers should be combined to feed into only one Mapper.
To support fault tolerance, input data is replicated inside the main memory buffers
and an input split is not overwritten until the corresponding Mapper commits.
When a Mapper fails, it re-reads its corresponding input split from any of the
replica inside the buffers. A Mapper writes its intermediate key-value pairs in
its own main-memory, and does not overwrite a set of key-value pairs until the
corresponding reducer commits. When a reducer fails, it re-reads its corresponding
sets of intermediate key-value pairs from the Mappers.
Search WWH ::




Custom Search