Database Reference
In-Depth Information
9.4
Systems of Declarative Interfaces for the MapReduce
Framework
For programmers, a key appealing feature in the MapReduce framework is that there
are only two main high-level declarative primitives ( map and reduce ) that can be
written in any programming language of choice and without worrying about the
details of their parallel execution. However, the MapReduce programming model
has its own limitations such as:
￿
Its one-input data format (key/value pairs) and two-stage data flow is extremely
rigid. As we have previously discussed, to perform tasks that have a different data
flow (e.g. joins or n stages) would require inelegant workarounds.
￿
Custom code has to be written for even the most common operations (e.g.
projection and filtering) which leads to the fact that the code is usually difficult
to reuse and maintain unless the users build and maintain their own libraries with
the common functions they use for processing their data.
Moreover, many programmers could be unfamiliar with the MapReduce frame-
work and they would prefer to use SQL (in which they are more proficient) as a high
level declarative language to express their task while leaving all of the execution
optimization details to the backend engine. In addition, it is beyond doubt that
high level language abstractions enable the underlying system to perform automatic
optimization. In the following subsection we discuss research efforts that have
been proposed to tackle these problems and add SQL-like interfaces on top of the
MapReduce framework.
Sawzall
Sawzall [ 195 ] is a scripting language used at Google on top of MapReduce. A
Sawzall program defines the operations to be performed on a single record of
the data. There is nothing in the language to enable examining multiple input
records simultaneously, or even to have the contents of one input record influence
the processing of another. The only output primitive in the language is the
emit statement, which sends data to an external aggregator (e.g. Sum, Average,
Maximum, Minimum) that gathers the results from each record after which the
results are then correlated and processed. The authors argue that aggregation is done
outside the language for a couple of reasons: (1) A more traditional language can
use the language to correlate results but some of the aggregation algorithms are
sophisticated and are best implemented in a native language and packaged in some
form. (2) Drawing an explicit line between filtering and aggregation enables a high
degree of parallelism and hides the parallelism from the language itself.
Figure 9.10 depicts an example Sawzall program where the first three lines
declare the aggregators count , total and sum of squares . The keyword table
Search WWH ::




Custom Search