Databases Reference
In-Depth Information
002BB5A52580A8ED 18
005BD9CD3AC6BB38 18
00A08A54CD03EB95 3
011ACA65C2BF70B2 5
01500FAFE317B7C0 15
0158F8ACC570947D 3
018FBF6BFB213E68 1
Conceptually we've performed an aggregating operation similar to the SQL query:
SELECT user, COUNT(*) FROM excite-small.log GROUP BY user;
Two main differences between the Pig Latin and SQL versions are worth pointing out.
As we've mentioned earlier, Pig Latin is a data processing language. You're specifying
a series of data processing steps instead of a complex SQL query with clauses. The
other difference is more subtle—relations in SQL always have fixed schemas. In SQL,
we define a relation's schema before it's populated with data. Pig takes a much looser
approach to schema. In fact, you don't need to use schemas if you don't want to,
which may be the case when handling semistructured or unstructured data. Here we
do specify a schema for the relation log , but it's only in the load statement and it's not
enforced until we're loading in the data. Any field that doesn't obey the schema in the
load operation is casted to a null. In this way the relation log is guaranteed to obey our
stated schema for subsequent operations.
As much as possible, Pig tries to figure out the schema for a relation based on
the operation used to create it. You can expose Pig's schema for any relation with
the DESCRIBE command. This can be useful in understanding what a Pig statement is
doing. For example, we'll look at the schemas for grpd and cntd . Before doing this,
let's first see how the DESCRIBE command describes log .
grunt> DESCRIBE log;
log: {user: chararray,time: long,query: chararray}
As expected, the load command gives log the exact schema we've specified. The rela-
tion log consists of three fields named user , time , and query . The fields user and
query are both strings ( chararray in Pig) whereas time is a long integer.
A GROUP BY operation on the relation log generates the relation grpd . Based on
the operation and the schema for log , Pig infers a schema for grpd :
grunt> DESCRIBE grpd;
grpd: {group: chararray,log: {user: chararray,time: long,query: chararray}}
group and log are two fields in grpd . The field log is a bag with subfields user , time ,
and query . As we haven't covered Pig's type system and the GROUP BY operation, we
don't expect you to understand this schema yet. The point is that relations in Pig can
have fairly complex schemas, and DESCRIBE is your friend in understanding the rela-
tions you're working with:
grunt> DESCRIBE cntd;
cntd: {group: chararray,long}
 
Search WWH ::




Custom Search