Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

count: table sum of int;

total: table sum of float;

sum_of_squares: table sum of

float;

x: float=input;

emit count <- 1;

emit total <- x;

emit sum_of_squares <- x * x;

FIGURE 2.10 An example Sawzall program. (From R. Pike et al., Scientific Programming ,

13(4), 277-298, 2005.)

is then compiled and executed, along with bindings to connect to externally pro-

vided aggregators. The data sets of Sawzall programs are often stored in Google File

System (GFS) [58]. The business of scheduling a job to run on a cluster of machines

is handled by a software called Workqueue , which creates a large-scale time shar-

ing system out of an array of computers and their disks. It schedules jobs, allocates

resources, reports status, and collects the results.

Google has also developed FlumeJava [30], a Java library for developing and run-

ning data-parallel pipelines on top of MapReduce. FlumeJava is centered around a

few classes that represent parallel collections. Parallel collections support a modest

number of parallel operations that are composed to implement data-parallel compu-

tations where an entire pipeline, or even multiple pipelines, can be translated into

a single Java program using the FlumeJava abstractions. To achieve good perfor-

mance, FlumeJava internally implements parallel operations using deferred evalu-

ation. The invocation of a parallel operation does not actually run the operation,

but instead simply records the operation and its arguments in an internal execution

plan graph structure. Once the execution plan for the whole computation has been

constructed, FlumeJava optimizes the execution plan and then runs the optimized

execution plan. When running the execution plan, FlumeJava chooses which strategy

to use to implement each operation (e.g., local sequential loop vs. remote parallel

MapReduce) based in part on the size of the data being processed, places remote

computations near the data on which they operate and performs independent opera-

tions in parallel.

2.4.2

P ig l atin

Olston et al. [109] have presented a language called Pig Latin that takes a middle posi-

tion between expressing task using the high-level declarative querying model in the

spirit of SQL and the low-level/procedural programming model using MapReduce.

Pig Latin is implemented in the scope of the Apache Pig project* and is used by

programmers at Yahoo! for developing data analysis tasks. Writing a Pig Latin pro-

gram is similar to specifying a query execution plan (e.g., a data flow graph). To

experienced programmers, this method is more appealing than encoding their task

as an SQL query and then coercing the system to choose the desired plan through

* http://incubator.apache.org/pig.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home