Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

The DEDUCE system [84] has been presented as a middleware that attempts to

combine real-time stream processing with the capabilities of a large-scale data analysis

framework like MapReduce. In particular, it extends the IBM's System S stream process-

ing engine and augments its capabilities with those of the MapReduce framework. In

this approach, the input data set to the MapReduce operator can be either prespecified at

compilation time or could be provided at runtime as a punctuated list of files or directo-

ries. Once the input data is available, the MapReduce operator spawns a MapReduce job

and produces a list of punctuated list of files or directories, which point to the output data.

Therefore, a MapReduce operator can potentially spawn multiple MapReduce jobs over

the application lifespan but such jobs are spawned only when the preceding job (if any)

has completed its execution. Hence, multiple jobs can be cascaded together to create a

data-flow of MapReduce operators where the output from the MapReduce operators can

be read to provide updates to the stream processing operators.

2.3.7 s ystem o Ptimizations

Several studies have been conducted to evaluate the performance characteristics of

the MapReduce framework. For example, Gu and Grossman [61] have reported the

following lessons that they have learned from their experiments with the MapReduce

framework:

•

The importance of data locality . Locality is a key factor especially when

relying on inexpensive commodity hardware.

•

Load balancing and the importance of identifying hot spots . With poor

load balancing, the entire system can be waiting for a single node. It is

important to eliminate any “hot spots,” which can be caused by data access

(accessing data from a single node) or network I/O (transferring data into or

out of a single node).

•

Fault tolerance comes with a price . In some cases, fault tolerance introduces

extra overhead to replicate the intermediate results. For example, in the cases

of running on small to medium sized clusters, it might be reasonable to favor

performance and rerun any failed intermediate task when necessary.

•

Streams are important . Streaming is important to reduce the total running

time of MapReduce jobs.

Jiang et al. [77] have conducted an in-depth performance study of MapReduce

using its open-source implementation, Hadoop. As an outcome of this study,

they identified some factors that can have significant performance impact on the

MapReduce framework. These factors are described as follows:

•

Although MapReduce is independent of the underlying storage system, it

still requires the storage system to provide efficient I/O modes for scanning

data. The experiments of the study on HDFS show that direct I/O outper-

forms streaming I/O by 10%-15%.

•

The MapReduce can utilize three kinds of indices (range indices, block-

level indices, and database-indexed tables) in a straightforward way. The

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home