An Overview of Large-Scale Stream Processing Engines - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

•

Data Input and Output Operators : DEDUCE provides an implementation

of operators that can read and write data to the underlying distributed file

system, while conforming to a certain data format. These operators, besides

being used by the users, are also used by the map and the reduce tasks to

access data that is assumed to be formatted in conformance to the input-

Format and the outputFormat parameters that are specified as part of the

MapReduce operator. These operators also hide the underlying distributed

file system using a common interface to access the file system.

DEDUCE also supports the creation of domain-specific MapReduce toolkits. A

toolkit is a collection of domain-specific UBOPs (possibly implemented by a domain

expert) and prewritten modules that may use the UBOPs specified in the toolkit. The

UBOPs contained in the toolkit typically implement functional units that, for exam-

ple, perform fast Fourier transforms on streamed digital signals. In other words,

operators are the building blocks employed for specifying a map/reduce module.

Prewritten modules can be directly used by DEDUCE developers to rapidly proto-

type and deploy domain-specific MapReduce jobs.

12.6 STREAMCLOUD

The StreamCloud (SC) system [11] has been presented as a scalable and elastic

distributed stream processing engine that builds on top of Borealis to provide

transparent query parallelization. In particular, the SC compiler takes the abstract

query and generates its parallel version that is deployed on a cluster that can con-

sists of a large set of shared-nothing nodes. In this approach, logical data streams

are split into multiple physical data substreams that flow in parallel, thus avoiding

single-node bottlenecks. Communication across different nodes is minimized and

only performed to guarantee semantic transparency. SC performs content-aware

stream splitting and encapsulates the parallelization logic within smart operators

that make sure that the outcome of the parallel execution matches the output of the

centralized execution. SC monitors its activity and dynamically reacts to work-

load variations by reorganizing the load among its nodes as well as provisioning

or decommissioning nodes. Elastic resource management is performed on-the-

fly with very low intrusiveness, thus making provisioning and decommissioning

cost-effective.

In SC, the main two main factors for the estimating the parallelization cost are

the number of hops performed by each tuple and the communication fan-out of each

node with the others. It strikes the best balance between communication and fan-out

overhead where queries are split into subqueries and each subquery is allocated to

a set of SC instances grouped in a subcluster . Data flows from one subcluster to the

next one, until the output of the system. All instances of a subcluster run the same

subquery, called local subquery , for a fraction of the input data stream, and produce

a fraction of the output data stream. Communication between subclusters guarantees

semantic transparency. Input queries are written in the Borealis query language and

automatically parallelized by SC through its query compiler. The latter allows easy

deployment of arbitrary complex queries over large clusters with just a few steps. The

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home