Big Data Processing Systems - Cloud Data Management - page 159

Database Reference

In-Depth Information

count: table sum of int;

total: table sum of float;

sumOfSquares: table sum of float;

x: float = input;

emit count $<$- 1;

emit total $<$- x;

emit sumOfSquares $<$- x x;

*

Fig. 9.10

An example Sawzall program

introduces an aggregator type which are called tables in Sawzall even though they

may be singletons. These particular tables are sum tables which add up the values

emitted to them, ints or floats as appropriate. The Sawzall language is implemented

as a conventional compiler, written in C CC , whose target language is an interpreted

instruction set, or byte-code. The compiler and the byte-code interpreter are part

of the same binary, so the user presents source code to Sawzall and the system

executes it directly. It is structured as a library with an external interface that accepts

source code which is then compiled and executed, along with bindings to connect to

externally-provided aggregators. The datasets of Sawzall programs are often stored

in Google File System (GFS) [ 137 ]. The business of scheduling a job to run on a

cluster of machines is handled by a software called Workqueue which creates a large-

scale time sharing system out of an array of computers and their disks. It schedules

jobs, allocates resources, reports status and collects the results.

Google has also developed FlumeJava [ 97 ], a Java library for developing and

running data-parallel pipelines on top of MapReduce. FlumeJava is centered around

a few classes that represent parallel collections. Parallel collections support a

modest number of parallel operations which are composed to implement data-

parallel computations where an entire pipeline, or even multiple pipelines, can

be translated into a single Java program using the FlumeJava abstractions. To

achieve good performance, FlumeJava internally implements parallel operations

using deferred evaluation. The invocation of a parallel operation does not actually

run the operation, but instead simply records the operation and its arguments in

an internal execution plan graph structure. Once the execution plan for the whole

computation has been constructed, FlumeJava optimizes the execution plan and then

runs the optimized execution plan. When running the execution plan, FlumeJava

chooses which strategy to use to implement each operation (e.g., local sequential

loop vs. remote parallel MapReduce) based in part on the size of the data being

processed, places remote computations near the data on which they operate and

performs independent operations in parallel.

Next Page

Cloud Data Management

Search WWH ::

Custom Search

Home