Database Reference
In-Depth Information
The DEDUCE system [ 166 ] has been presented as a middleware that attempts
to combine real-time stream processing with the capabilities of a large scale data
analysis framework like MapReduce. In particular, it extends the IBM's System S
stream processing engine and augments its capabilities with those of the MapReduce
framework. In this approach, the input data set to the MapReduce operator can
be either pre-specified at compilation time or could be provided at runtime as
a punctuated list of files or directories. Once the input data is available, the
MapReduce operator spawns a MapReduce job and produces a list of punctuated
list of files or directories, which point to the output data. Therefore, a MapReduce
operator can potentially spawn multiple MapReduce jobs over the application
lifespan but such jobs are spawned only when the preceding job (if any) has
completed its execution. Hence, multiple jobs can be cascaded together to create a
data-flow of MapReduce operators where the output from the MapReduce operators
can be read to provide updates to the stream processing operators.
System Optimizations
Several studies have been conducted to evaluate the performance characteristics of
the MapReduce framework. For example, Gu and Grossman [ 141 ] have reported
the following lessons which they have learned from their experiments with the
MapReduce framework:
￿
The importance of data locality . Locality is a key factor especially when relying
on inexpensive commodity hardware.
￿
Load balancing and the importance of identifying hot spots . With poor load
balancing, the entire system can be waiting for a single node. It is important
to eliminate any “hot spots” which can be caused by data access (accessing data
from a single node) or network I/O (transferring data into or out of a single node).
￿
Fault tolerance comes with a price . In some cases, fault tolerance introduces
extra overhead in order to replicate the intermediate results. For example, in the
cases of running on small to medium sized clusters, it might be reasonable to
favor performance and re-run any failed intermediate task when necessary.
￿
Streams are important . Streaming is important in order to reduce the total running
time of MapReduce jobs.
Jiang et al. [ 155 ] have conducted an in-depth performance study of MapReduce
using its open source implementation, Hadoop. As an outcome of this study,
they identified some factors that can have significant performance impact on the
MapReduce framework. These factors are described as follows:
￿
Although MapReduce is independent of the underlying storage system, it still
requires the storage system to provide efficient I/O modes for scanning data. The
experiments of the study on HDFS show that direct I/O outperforms streaming
I/O by 10-15 %.
Search WWH ::




Custom Search