Programming with Pig - Hadoop in Action

Databases Reference

In-Depth Information

002BB5A52580A8ED 18

005BD9CD3AC6BB38 18

00A08A54CD03EB95 3

011ACA65C2BF70B2 5

01500FAFE317B7C0 15

0158F8ACC570947D 3

018FBF6BFB213E68 1

Conceptually we've performed an aggregating operation similar to the SQL query:

SELECT user, COUNT(*) FROM excite-small.log GROUP BY user;

Two main differences between the Pig Latin and SQL versions are worth pointing out.

As we've mentioned earlier, Pig Latin is a data processing language. You're specifying

a series of data processing steps instead of a complex SQL query with clauses. The

other difference is more subtle—relations in SQL always have fixed schemas. In SQL,

we define a relation's schema before it's populated with data. Pig takes a much looser

approach to schema. In fact, you don't need to use schemas if you don't want to,

which may be the case when handling semistructured or unstructured data. Here we

do specify a schema for the relation log , but it's only in the load statement and it's not

enforced until we're loading in the data. Any field that doesn't obey the schema in the

load operation is casted to a null. In this way the relation log is guaranteed to obey our

stated schema for subsequent operations.

As much as possible, Pig tries to figure out the schema for a relation based on

the operation used to create it. You can expose Pig's schema for any relation with

the DESCRIBE command. This can be useful in understanding what a Pig statement is

doing. For example, we'll look at the schemas for grpd and cntd . Before doing this,

let's first see how the DESCRIBE command describes log .

grunt> DESCRIBE log;

log: {user: chararray,time: long,query: chararray}

As expected, the load command gives log the exact schema we've specified. The rela-

tion log consists of three fields named user , time , and query . The fields user and

query are both strings ( chararray in Pig) whereas time is a long integer.

A GROUP BY operation on the relation log generates the relation grpd . Based on

the operation and the schema for log , Pig infers a schema for grpd :

grunt> DESCRIBE grpd;

grpd: {group: chararray,log: {user: chararray,time: long,query: chararray}}

group and log are two fields in grpd . The field log is a bag with subfields user , time ,

and query . As we haven't covered Pig's type system and the GROUP BY operation, we

don't expect you to understand this schema yet. The point is that relations in Pig can

have fairly complex schemas, and DESCRIBE is your friend in understanding the rela-

tions you're working with:

grunt> DESCRIBE cntd;

cntd: {group: chararray,long}

Search WWH ::

Custom Search

Home