Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

Parser

Local optimizer

MapReduce

compiler

MapReduce

optimizer

Hadoop

FIGURE 2.12 Pig compilation and execution steps. (From C. Olston et al., Pig latin: A not-

so-foreign language for data processing, in SIGMOD , pp. 1099-1110, 2008.)

To accommodate specialized data-processing tasks, Pig Latin has extensive sup-

port for user-defined functions (UDFs). The input and output of UDFs in Pig Latin

follow its fully nested data model. Pig Latin is architected such that the parsing of the

Pig Latin program and the logical plan construction is independent of the execution

platform. Only the compilation of the logical plan into a physical plan depends on

the specific execution platform chosen. Currently, Pig Latin programs are compiled

into sequences of MapReduce jobs that are executed using the Hadoop MapReduce

environment. In particular, a Pig Latin program goes through a series of transforma-

tion steps [109] before being executed as depicted in Figure 2.12. The parsing steps

verifies that the program is syntactically correct and that all referenced variables are

defined. The output of the parser is a canonical logical plan with a one-to-one cor-

respondence between Pig Latin statements and logical operators that are arranged in

a directed acyclic graph (DAG). The logical plan generated by the parser is passed

through a logical optimizer. In this stage, logical optimizations such as projection

pushdown are carried out. The optimized logical plan is then compiled into a series

of MapReduce jobs that are then passed through another optimization phase. The

DAG of optimized MapReduce jobs is then topologically sorted and jobs are submit-

ted to Hadoop for execution.

2.4.3 h ive

The Hive project* is an open-source data warehousing solution that has been built

by the Facebook Data Infrastructure Team on top of the Hadoop environment [123].

The main goal of this project is to bring the familiar relational database concepts

(e.g., tables, columns, partitions) and a subset of SQL to the unstructured world of

Hadoop while still maintaining the extensibility and flexibility that Hadoop provides.

* http://hadoop.apache.org/hive/.

Search WWH ::

Custom Search

Home