Database Reference
In-Depth Information
The DEDUCE system [84] has been presented as a middleware that attempts to
combine real-time stream processing with the capabilities of a large-scale data analysis
framework like MapReduce. In particular, it extends the IBM's System S stream process-
ing engine and augments its capabilities with those of the MapReduce framework. In
this approach, the input data set to the MapReduce operator can be either prespecified at
compilation time or could be provided at runtime as a punctuated list of files or directo-
ries. Once the input data is available, the MapReduce operator spawns a MapReduce job
and produces a list of punctuated list of files or directories, which point to the output data.
Therefore, a MapReduce operator can potentially spawn multiple MapReduce jobs over
the application lifespan but such jobs are spawned only when the preceding job (if any)
has completed its execution. Hence, multiple jobs can be cascaded together to create a
data-flow of MapReduce operators where the output from the MapReduce operators can
be read to provide updates to the stream processing operators.
2.3.7 s ystem o Ptimizations
Several studies have been conducted to evaluate the performance characteristics of
the MapReduce framework. For example, Gu and Grossman [61] have reported the
following lessons that they have learned from their experiments with the MapReduce
framework:
The importance of data locality . Locality is a key factor especially when
relying on inexpensive commodity hardware.
Load balancing and the importance of identifying hot spots . With poor
load balancing, the entire system can be waiting for a single node. It is
important to eliminate any “hot spots,” which can be caused by data access
(accessing data from a single node) or network I/O (transferring data into or
out of a single node).
Fault tolerance comes with a price . In some cases, fault tolerance introduces
extra overhead to replicate the intermediate results. For example, in the cases
of running on small to medium sized clusters, it might be reasonable to favor
performance and rerun any failed intermediate task when necessary.
Streams are important . Streaming is important to reduce the total running
time of MapReduce jobs.
Jiang et al. [77] have conducted an in-depth performance study of MapReduce
using its open-source implementation, Hadoop. As an outcome of this study,
they identified some factors that can have significant performance impact on the
MapReduce framework. These factors are described as follows:
Although MapReduce is independent of the underlying storage system, it
still requires the storage system to provide efficient I/O modes for scanning
data. The experiments of the study on HDFS show that direct I/O outper-
forms streaming I/O by 10%-15%.
The MapReduce can utilize three kinds of indices (range indices, block-
level indices, and database-indexed tables) in a straightforward way. The
Search WWH ::




Custom Search