Databases Reference
In-Depth Information
Pig has two major components:
A high-level data processing language called Pig Latin.
1
A compiler that compiles and runs your Pig Latin script
in a choice of
evaluation mechanisms . The main evaluation mechanism is Hadoop. Pig also
supports a local mode for development purposes.
2
Pig simplifies programming because of the ease of expressing your code in Pig Latin.
The compiler helps to automatically exploit optimization opportunities in your script.
This frees you from having to tune your program manually. As the Pig compiler im-
proves, your Pig Latin program will also get an automatic speed-up.
10.1
Thinking like a Pig
Pig has a certain philosophy about its design. We expect ease of use, high perfor-
mance, and massive scalability from any Hadoop subproject. More unique and crucial
to understanding Pig are the design choices of its programming language (a data flow
language called Pig Latin), the data types it supports, and its treatment of user-defined
functions (UDFs) as first-class citizens.
10.1.1 Data flow language
You write Pig Latin
programs in a sequence of steps where each step is a single high-
level data transformation. The transformations support relational-style operations,
such as filter, union, group, and join. An example Pig Latin program that processes a
search query log may look like
log = LOAD 'excite-small.log' AS (user, time, query);
grpd = GROUP log BY user;
cntd = FOREACH grpd GENERATE group, COUNT(log);
DUMP cntd;
Even though the operations are relational in style, Pig Latin remains a data flow language.
A data flow language
is friendlier to programmers who think in terms of algorithms,
which are more naturally expressed by the data and control flows. On the other hand, a
declarative language such as SQL
is sometimes easier for analysts who prefer to just state
the results one expects from a program. Hive
is a different Hadoop subproject that tar-
gets users who prefer the SQL model. We'll learn about Hive in detail in chapter 11.
10.1.2 Data types
We can summarize Pig's philosophy toward data types
in its slogan of “Pigs eat any-
thing.”
Input data can come in any format. Popular formats, such as tab-delimited text
files, are natively supported. Users can add functions to support other data file formats
as well. Pig doesn't require metadata or schema on data, but it can take advantage of
them if they're provided.
Pig can operate on data that is relational, nested, semistructured, or unstructured.
To support this diversity of data, Pig supports complex data types, such as bags
and
tuples
that can be nested to form fairly sophisticated data structures.
 
Search WWH ::




Custom Search