Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

9.4

Systems of Declarative Interfaces for the MapReduce

Framework

For programmers, a key appealing feature in the MapReduce framework is that there

are only two main high-level declarative primitives ( map and reduce ) that can be

written in any programming language of choice and without worrying about the

details of their parallel execution. However, the MapReduce programming model

has its own limitations such as:

Its one-input data format (key/value pairs) and two-stage data flow is extremely

rigid. As we have previously discussed, to perform tasks that have a different data

flow (e.g. joins or n stages) would require inelegant workarounds.

Custom code has to be written for even the most common operations (e.g.

projection and filtering) which leads to the fact that the code is usually difficult

to reuse and maintain unless the users build and maintain their own libraries with

the common functions they use for processing their data.

Moreover, many programmers could be unfamiliar with the MapReduce frame-

work and they would prefer to use SQL (in which they are more proficient) as a high

level declarative language to express their task while leaving all of the execution

optimization details to the backend engine. In addition, it is beyond doubt that

high level language abstractions enable the underlying system to perform automatic

optimization. In the following subsection we discuss research efforts that have

been proposed to tackle these problems and add SQL-like interfaces on top of the

MapReduce framework.

Sawzall

Sawzall [ 195 ] is a scripting language used at Google on top of MapReduce. A

Sawzall program defines the operations to be performed on a single record of

the data. There is nothing in the language to enable examining multiple input

records simultaneously, or even to have the contents of one input record influence

the processing of another. The only output primitive in the language is the

emit statement, which sends data to an external aggregator (e.g. Sum, Average,

Maximum, Minimum) that gathers the results from each record after which the

results are then correlated and processed. The authors argue that aggregation is done

outside the language for a couple of reasons: (1) A more traditional language can

use the language to correlate results but some of the aggregation algorithms are

sophisticated and are best implemented in a native language and packaged in some

form. (2) Drawing an explicit line between filtering and aggregation enables a high

degree of parallelism and hides the parallelism from the language itself.

Figure 9.10 depicts an example Sawzall program where the first three lines

declare the aggregators count , total and sum of squares . The keyword table

Search WWH ::

Custom Search

Home