Database Reference
In-Depth Information
count: table sum of int;
total: table sum of float;
sumOfSquares: table sum of float;
x: float = input;
emit count $<$- 1;
emit total $<$- x;
emit sumOfSquares $<$- x x;
*
Fig. 9.10
An example Sawzall program
introduces an aggregator type which are called tables in Sawzall even though they
may be singletons. These particular tables are sum tables which add up the values
emitted to them, ints or floats as appropriate. The Sawzall language is implemented
as a conventional compiler, written in C CC , whose target language is an interpreted
instruction set, or byte-code. The compiler and the byte-code interpreter are part
of the same binary, so the user presents source code to Sawzall and the system
executes it directly. It is structured as a library with an external interface that accepts
source code which is then compiled and executed, along with bindings to connect to
externally-provided aggregators. The datasets of Sawzall programs are often stored
in Google File System (GFS) [ 137 ]. The business of scheduling a job to run on a
cluster of machines is handled by a software called Workqueue which creates a large-
scale time sharing system out of an array of computers and their disks. It schedules
jobs, allocates resources, reports status and collects the results.
Google has also developed FlumeJava [ 97 ], a Java library for developing and
running data-parallel pipelines on top of MapReduce. FlumeJava is centered around
a few classes that represent parallel collections. Parallel collections support a
modest number of parallel operations which are composed to implement data-
parallel computations where an entire pipeline, or even multiple pipelines, can
be translated into a single Java program using the FlumeJava abstractions. To
achieve good performance, FlumeJava internally implements parallel operations
using deferred evaluation. The invocation of a parallel operation does not actually
run the operation, but instead simply records the operation and its arguments in
an internal execution plan graph structure. Once the execution plan for the whole
computation has been constructed, FlumeJava optimizes the execution plan and then
runs the optimized execution plan. When running the execution plan, FlumeJava
chooses which strategy to use to implement each operation (e.g., local sequential
loop vs. remote parallel MapReduce) based in part on the size of the data being
processed, places remote computations near the data on which they operate and
performs independent operations in parallel.
 
Search WWH ::




Custom Search