Database Reference
In-Depth Information
Data Input and Output Operators : DEDUCE provides an implementation
of operators that can read and write data to the underlying distributed file
system, while conforming to a certain data format. These operators, besides
being used by the users, are also used by the map and the reduce tasks to
access data that is assumed to be formatted in conformance to the input-
Format and the outputFormat parameters that are specified as part of the
MapReduce operator. These operators also hide the underlying distributed
file system using a common interface to access the file system.
DEDUCE also supports the creation of domain-specific MapReduce toolkits. A
toolkit is a collection of domain-specific UBOPs (possibly implemented by a domain
expert) and prewritten modules that may use the UBOPs specified in the toolkit. The
UBOPs contained in the toolkit typically implement functional units that, for exam-
ple, perform fast Fourier transforms on streamed digital signals. In other words,
operators are the building blocks employed for specifying a map/reduce module.
Prewritten modules can be directly used by DEDUCE developers to rapidly proto-
type and deploy domain-specific MapReduce jobs.
12.6 STREAMCLOUD
The StreamCloud (SC) system [11] has been presented as a scalable and elastic
distributed stream processing engine that builds on top of Borealis to provide
transparent query parallelization. In particular, the SC compiler takes the abstract
query and generates its parallel version that is deployed on a cluster that can con-
sists of a large set of shared-nothing nodes. In this approach, logical data streams
are split into multiple physical data substreams that flow in parallel, thus avoiding
single-node bottlenecks. Communication across different nodes is minimized and
only performed to guarantee semantic transparency. SC performs content-aware
stream splitting and encapsulates the parallelization logic within smart operators
that make sure that the outcome of the parallel execution matches the output of the
centralized execution. SC monitors its activity and dynamically reacts to work-
load variations by reorganizing the load among its nodes as well as provisioning
or decommissioning nodes. Elastic resource management is performed on-the-
fly with very low intrusiveness, thus making provisioning and decommissioning
cost-effective.
In SC, the main two main factors for the estimating the parallelization cost are
the number of hops performed by each tuple and the communication fan-out of each
node with the others. It strikes the best balance between communication and fan-out
overhead where queries are split into subqueries and each subquery is allocated to
a set of SC instances grouped in a subcluster . Data flows from one subcluster to the
next one, until the output of the system. All instances of a subcluster run the same
subquery, called local subquery , for a fraction of the input data stream, and produce
a fraction of the output data stream. Communication between subclusters guarantees
semantic transparency. Input queries are written in the Borealis query language and
automatically parallelized by SC through its query compiler. The latter allows easy
deployment of arbitrary complex queries over large clusters with just a few steps. The
Search WWH ::




Custom Search