Database Reference
In-Depth Information
Fig. 9.11
An example SQL query and its equivalent Pig Latin program
Pig Latin
Olston et al. [ 188 ] have presented a language called Pig Latin that takes a middle
position between expressing task using the high-level declarative querying model
in the spirit of SQL and the low-level/procedural programming model using
MapReduce. Pig Latin is implemented in the scope of the Apache Pig project [ 12 ]
and is used by programmers at Yahoo! for developing data analysis tasks. Writing a
Pig Latin program is similar to specifying a query execution plan (e.g. a data flow
graph). To experienced programmers, this method is more appealing than encoding
their task as an SQL query and then coercing the system to choose the desired
plan through optimizer hints. In general, automatic query optimization has its limits
especially with uncataloged data, prevalent user-defined functions and parallel exe-
cution, which are all features of the data analysis tasks targeted by the MapReduce
framework. Figure 9.11 shows an example SQL query and its equivalent Pig Latin
program. Given a URL table with the structure . url ; category ; pagerank /,thetask
of the SQL query is to find each large category and its average pagerank of high-
pagerank urls (> 0.2). A Pig Latin program is described as a sequence of steps where
each step represents a single data transformation. This characteristic is appealing to
many programmers. At the same time, the transformation steps are described using
high-level primitives (e.g. filtering, grouping, aggregation) much like in SQL.
Pig Latin has several other features that are important for casual ad-hoc data
analysis tasks. These features include support for a flexible, fully nested data model,
extensive support for user-defined functions and the ability to operate over plain
input files without any schema information [ 136 ]. In particular, Pig Latin has a
simple data model consisting of the following four types:
1. Atom : An atom contains a simple atomic value such as a string or a number, e.g.
“alice”.
2. Tuple : A tuple is a sequence of fields, each of which can be any of the data types,
e.g. (“alice”, “lakers”).
3. Bag : A bag is a collection of tuples with possible duplicates. The schema of the
constituent tuples is flexible where not all tuples in a bag need to have the same
number and type of fields
e.g. .“ alice ”; “ lakers ”/
.“ alice ”;.“ iPod ”; “ apple ”//
:
Search WWH ::




Custom Search