Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

written in any programming language of choice and without worrying about the

details of their parallel execution. However, the MapReduce programming model

has its own limitations such as

•

Its one-input data format (key/value pairs) and two-stage data flow is

extremely rigid. As we have previously discussed, to perform tasks that

have a different data flow (e.g., joins or n stages) would require inelegant

workarounds.

•

Custom code has to be written for even the most common operations (e.g.,

projection and filtering), which leads to the fact that the code is usually dif-

ficult to reuse and maintain unless the users build and maintain their own

libraries with the common functions they use for processing their data.

Moreover, many programmers could be unfamiliar with the MapReduce frame-

work and they would prefer to use SQL (in which they are more proficient) as a

high-level declarative language to express their task while leaving all of the execu-

tion optimization details to the backend engine. In addition, it is beyond doubt that

high-level language abstractions enable the underlying system to perform automatic

optimization. In the following subsection we discuss research efforts that have

been proposed to tackle these problems and add SQL-like interfaces on top of the

MapReduce framework.

2.4.1 s awzall

Sawzall [114] is a scripting language used at Google on top of MapReduce. A

Sawzall program defines the operations to be performed on a single record of the

data. There is nothing in the language to enable examining multiple input records

simultaneously, or even to have the contents of one input record influence the pro-

cessing of another. The only output primitive in the language is the emit statement,

which sends data to an external aggregator (e.g., sum, average, maximum, minimum)

that gathers the results from each record after which the results are then correlated

and processed. The authors argue that aggregation is done outside the language for a

couple of reasons: (1) A more traditional language can use the language to correlate

results but some of the aggregation algorithms are sophisticated and are best imple-

mented in a native language and packaged in some form. (2) Drawing an explicit line

between filtering and aggregation enables a high degree of parallelism and hides the

parallelism from the language itself.

Figure 2.10 depicts an example Sawzall program where the first three lines declare

the aggregators count , total , and sum of squares . The keyword table introduces an

aggregator type that are called tables in Sawzall even though they may be singletons.

These particular tables are sum tables that add up the values emitted to them, ints

or floats as appropriate. The Sawzall language is implemented as a conventional

compiler, written in C++, whose target language is an interpreted instruction set, or

byte-code. The compiler and the byte-code interpreter are part of the same binary,

so the user presents source code to Sawzall and the system executes it directly. It is

structured as a library with an external interface that accepts source code, which

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home