Programming with Pig - Hadoop in Action

Databases Reference

In-Depth Information

Pig has two major components:

A high-level data processing language called Pig Latin.

1

A compiler that compiles and runs your Pig Latin script

in a choice of

evaluation mechanisms . The main evaluation mechanism is Hadoop. Pig also

supports a local mode for development purposes.

2

Pig simplifies programming because of the ease of expressing your code in Pig Latin.

The compiler helps to automatically exploit optimization opportunities in your script.

This frees you from having to tune your program manually. As the Pig compiler im-

proves, your Pig Latin program will also get an automatic speed-up.

10.1

Thinking like a Pig

Pig has a certain philosophy about its design. We expect ease of use, high perfor-

mance, and massive scalability from any Hadoop subproject. More unique and crucial

to understanding Pig are the design choices of its programming language (a data flow

language called Pig Latin), the data types it supports, and its treatment of user-defined

functions (UDFs) as first-class citizens.

10.1.1 Data flow language

You write Pig Latin

programs in a sequence of steps where each step is a single high-

level data transformation. The transformations support relational-style operations,

such as filter, union, group, and join. An example Pig Latin program that processes a

search query log may look like

log = LOAD 'excite-small.log' AS (user, time, query);

grpd = GROUP log BY user;

cntd = FOREACH grpd GENERATE group, COUNT(log);

DUMP cntd;

Even though the operations are relational in style, Pig Latin remains a data flow language.

A data flow language

is friendlier to programmers who think in terms of algorithms,

which are more naturally expressed by the data and control flows. On the other hand, a

declarative language such as SQL

is sometimes easier for analysts who prefer to just state

the results one expects from a program. Hive

is a different Hadoop subproject that tar-

gets users who prefer the SQL model. We'll learn about Hive in detail in chapter 11.

10.1.2 Data types

We can summarize Pig's philosophy toward data types

in its slogan of “Pigs eat any-

thing.”

Input data can come in any format. Popular formats, such as tab-delimited text

files, are natively supported. Users can add functions to support other data file formats

as well. Pig doesn't require metadata or schema on data, but it can take advantage of

them if they're provided.

Pig can operate on data that is relational, nested, semistructured, or unstructured.

To support this diversity of data, Pig supports complex data types, such as bags

and

tuples

that can be nested to form fairly sophisticated data structures.

Hadoop in Action

Search WWH ::

Custom Search

Home