An Overview of Large-Scale Stream Processing Engines - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

12.5 DEDUCE

In general, while MapReduce offers the capability to analyze several terabytes of

stored data, stream processing solutions offer the ability to process, possibly, a few mil-

lion updates every second. However, there is an increasing number of data-processing

applications, which need a solution that effectively and efficiently combines the ben-

efits of MapReduce and stream processing to address their data-processing needs.

DEDUCE [16] is a middleware that has been designed to offer a unified abstraction and

runtime for addressing the needs of modern data-processing applications. It attempts

to combine real-time stream processing with the capabilities of a massive data analysis

framework like MapReduce by providing the following features:

•

Language Constructs : DEDUCE extends SPADE'S data-flow composition

language to enable the specification and use of MapReduce jobs as data-

flow elements.

•

Reusable Modules : DEDUCE provides the capability to describe reusable

modules for implementing offline MapReduce tasks aimed at calibrated

analytic models.

•

Runtime Support : DEDUCE augments the System S runtime infrastructure

to support the execution and optimized deployment of map and reduce tasks.

•

Control Parameters : DEDUCE provides configuration parameters (e.g.,

update frequency, resource utilization hints, etc.) associated with the

MapReduce jobs that can be tweaked to perform end-to-end system optimi-

zations and shared resource management.

The DEDUCE-specific language extensions to the SPADE language has been

designed to achieve the following goals:

1. To be able to easily specify the MapReduce jobs

2. To support MapReduce jobs as composable data-flow elements

3. To provide the capability for creating domain-specific collection of map

and reduce modules

In particular, the DEDUCE language extensions consists of two important com-

ponents: the DEDUCE Operator Toolkit and the module specification framework.

DEDUCE Operator Toolkit contains the following operators:

•

MapReduce Operator : DEDUCE models the MapReduce job as a SPADE

operator. This approach simplifies the design of applications that combine the

data at rest with the data in motion. While the input data set for a MapReduce

job can either be specified as a parameter to the operator or as a punctuated

input stream containing the location of directories or files to be processed, the

output of the MapReduce job is written to a prespecified location on the dis-

tributed file system and the location of this output data is optionally available

as a punctuated output stream from the MapReduce operator.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home