Database Reference
In-Depth Information
written in any programming language of choice and without worrying about the
details of their parallel execution. However, the MapReduce programming model
has its own limitations such as
Its one-input data format (key/value pairs) and two-stage data flow is
extremely rigid. As we have previously discussed, to perform tasks that
have a different data flow (e.g., joins or n stages) would require inelegant
workarounds.
Custom code has to be written for even the most common operations (e.g.,
projection and filtering), which leads to the fact that the code is usually dif-
ficult to reuse and maintain unless the users build and maintain their own
libraries with the common functions they use for processing their data.
Moreover, many programmers could be unfamiliar with the MapReduce frame-
work and they would prefer to use SQL (in which they are more proficient) as a
high-level declarative language to express their task while leaving all of the execu-
tion optimization details to the backend engine. In addition, it is beyond doubt that
high-level language abstractions enable the underlying system to perform automatic
optimization. In the following subsection we discuss research efforts that have
been proposed to tackle these problems and add SQL-like interfaces on top of the
MapReduce framework.
2.4.1 s awzall
Sawzall [114] is a scripting language used at Google on top of MapReduce. A
Sawzall program defines the operations to be performed on a single record of the
data. There is nothing in the language to enable examining multiple input records
simultaneously, or even to have the contents of one input record influence the pro-
cessing of another. The only output primitive in the language is the emit statement,
which sends data to an external aggregator (e.g., sum, average, maximum, minimum)
that gathers the results from each record after which the results are then correlated
and processed. The authors argue that aggregation is done outside the language for a
couple of reasons: (1) A more traditional language can use the language to correlate
results but some of the aggregation algorithms are sophisticated and are best imple-
mented in a native language and packaged in some form. (2) Drawing an explicit line
between filtering and aggregation enables a high degree of parallelism and hides the
parallelism from the language itself.
Figure 2.10 depicts an example Sawzall program where the first three lines declare
the aggregators count , total , and sum of squares . The keyword table introduces an
aggregator type that are called tables in Sawzall even though they may be singletons.
These particular tables are sum tables that add up the values emitted to them, ints
or floats as appropriate. The Sawzall language is implemented as a conventional
compiler, written in C++, whose target language is an interpreted instruction set, or
byte-code. The compiler and the byte-code interpreter are part of the same binary,
so the user presents source code to Sawzall and the system executes it directly. It is
structured as a library with an external interface that accepts source code, which
Search WWH ::




Custom Search