Big Data Processing Systems - Cloud Data Management - page 160

Database Reference

In-Depth Information

Fig. 9.11

An example SQL query and its equivalent Pig Latin program

Pig Latin

Olston et al. [ 188 ] have presented a language called Pig Latin that takes a middle

position between expressing task using the high-level declarative querying model

in the spirit of SQL and the low-level/procedural programming model using

MapReduce. Pig Latin is implemented in the scope of the Apache Pig project [ 12 ]

and is used by programmers at Yahoo! for developing data analysis tasks. Writing a

Pig Latin program is similar to specifying a query execution plan (e.g. a data flow

graph). To experienced programmers, this method is more appealing than encoding

their task as an SQL query and then coercing the system to choose the desired

plan through optimizer hints. In general, automatic query optimization has its limits

especially with uncataloged data, prevalent user-defined functions and parallel exe-

cution, which are all features of the data analysis tasks targeted by the MapReduce

framework. Figure 9.11 shows an example SQL query and its equivalent Pig Latin

program. Given a URL table with the structure . url ; category ; pagerank /,thetask

of the SQL query is to find each large category and its average pagerank of high-

pagerank urls (> 0.2). A Pig Latin program is described as a sequence of steps where

each step represents a single data transformation. This characteristic is appealing to

many programmers. At the same time, the transformation steps are described using

high-level primitives (e.g. filtering, grouping, aggregation) much like in SQL.

Pig Latin has several other features that are important for casual ad-hoc data

analysis tasks. These features include support for a flexible, fully nested data model,

extensive support for user-defined functions and the ability to operate over plain

input files without any schema information [ 136 ]. In particular, Pig Latin has a

simple data model consisting of the following four types:

1. Atom : An atom contains a simple atomic value such as a string or a number, e.g.

“alice”.

2. Tuple : A tuple is a sequence of fields, each of which can be any of the data types,

e.g. (“alice”, “lakers”).

3. Bag : A bag is a collection of tuples with possible duplicates. The schema of the

constituent tuples is flexible where not all tuples in a bag need to have the same

number and type of fields

e.g. .“ alice ”; “ lakers ”/

.“ alice ”;.“ iPod ”; “ apple ”//

:

Next Page

Cloud Data Management

Search WWH ::

Custom Search

Home