Databases Reference
In-Depth Information
002BB5A52580A8ED 18
005BD9CD3AC6BB38 18
00A08A54CD03EB95 3
011ACA65C2BF70B2 5
01500FAFE317B7C0 15
0158F8ACC570947D 3
018FBF6BFB213E68 1
Conceptually we've performed an aggregating operation similar to the SQL query:
SELECT user, COUNT(*) FROM excite-small.log GROUP BY user;
Two main differences between the Pig Latin and SQL versions are worth pointing out.
As we've mentioned earlier, Pig Latin is a data processing language. You're specifying
a series of data processing steps instead of a complex SQL query with clauses. The
other difference is more subtle—relations in SQL always have fixed schemas. In SQL,
we define a relation's schema before it's populated with data. Pig takes a much looser
approach to schema. In fact, you don't need to use schemas if you don't want to,
which may be the case when handling semistructured or unstructured data. Here we
do specify a schema for the relation
log
, but it's only in the load statement and it's not
enforced until we're loading in the data. Any field that doesn't obey the schema in the
load operation is casted to a null. In this way the relation
log
is guaranteed to obey our
stated schema for subsequent operations.
As much as possible, Pig tries to figure out the schema for a relation based on
the operation used to create it. You can expose Pig's schema for any relation with
the
DESCRIBE
command. This can be useful in understanding what a Pig statement is
doing. For example, we'll look at the schemas for
grpd
and
cntd
. Before doing this,
let's first see how the
DESCRIBE
command describes
log
.
grunt> DESCRIBE log;
log: {user: chararray,time: long,query: chararray}
As expected, the load command gives
log
the exact schema we've specified. The rela-
tion
log
consists of three fields named
user
,
time
, and
query
. The fields
user
and
query
are both strings (
chararray
in Pig) whereas
time
is a
long
integer.
A GROUP BY
operation on the relation
log
generates the relation
grpd
. Based on
the operation and the schema for
log
, Pig infers a schema for
grpd
:
grunt> DESCRIBE grpd;
grpd: {group: chararray,log: {user: chararray,time: long,query: chararray}}
group
and
log
are two fields in
grpd
. The field
log
is a
bag
with subfields
user
,
time
,
and
query
. As we haven't covered Pig's type system and the
GROUP BY
operation, we
don't expect you to understand this schema yet. The point is that relations in Pig can
have fairly complex schemas, and
DESCRIBE
is your friend in understanding the rela-
tions you're working with:
grunt> DESCRIBE cntd;
cntd: {group: chararray,long}