Database Reference
In-Depth Information
count: table sum of int;
total: table sum of float;
sum_of_squares: table sum of
float;
x: float=input;
emit count <- 1;
emit total <- x;
emit sum_of_squares <- x * x;
FIGURE 2.10 An example Sawzall program. (From R. Pike et al., Scientific Programming ,
13(4), 277-298, 2005.)
is then compiled and executed, along with bindings to connect to externally pro-
vided aggregators. The data sets of Sawzall programs are often stored in Google File
System (GFS) [58]. The business of scheduling a job to run on a cluster of machines
is handled by a software called Workqueue , which creates a large-scale time shar-
ing system out of an array of computers and their disks. It schedules jobs, allocates
resources, reports status, and collects the results.
Google has also developed FlumeJava [30], a Java library for developing and run-
ning data-parallel pipelines on top of MapReduce. FlumeJava is centered around a
few classes that represent parallel collections. Parallel collections support a modest
number of parallel operations that are composed to implement data-parallel compu-
tations where an entire pipeline, or even multiple pipelines, can be translated into
a single Java program using the FlumeJava abstractions. To achieve good perfor-
mance, FlumeJava internally implements parallel operations using deferred evalu-
ation. The invocation of a parallel operation does not actually run the operation,
but instead simply records the operation and its arguments in an internal execution
plan graph structure. Once the execution plan for the whole computation has been
constructed, FlumeJava optimizes the execution plan and then runs the optimized
execution plan. When running the execution plan, FlumeJava chooses which strategy
to use to implement each operation (e.g., local sequential loop vs. remote parallel
MapReduce) based in part on the size of the data being processed, places remote
computations near the data on which they operate and performs independent opera-
tions in parallel.
2.4.2
P ig l atin
Olston et al. [109] have presented a language called Pig Latin that takes a middle posi-
tion between expressing task using the high-level declarative querying model in the
spirit of SQL and the low-level/procedural programming model using MapReduce.
Pig Latin is implemented in the scope of the Apache Pig project* and is used by
programmers at Yahoo! for developing data analysis tasks. Writing a Pig Latin pro-
gram is similar to specifying a query execution plan (e.g., a data flow graph). To
experienced programmers, this method is more appealing than encoding their task
as an SQL query and then coercing the system to choose the desired plan through
* http://incubator.apache.org/pig.
 
Search WWH ::




Custom Search