Database Reference
In-Depth Information
The fields are year, number of unique stations, total number of good readings, and total
number of readings. We can see how the number of weather stations and readings grew
over time.
STREAM
The
STREAM
operator allows you to transform data in a relation using an external pro-
gram or script. It is named by analogy with Hadoop Streaming, which provides a similar
capability for MapReduce (see
Hadoop Streaming
).
STREAM
can use built-in commands with arguments. Here is an example that uses the
Unix
cut
command to extract the second field of each tuple in
A
. Note that the command
and its arguments are enclosed in backticks:
grunt>
C = STREAM A THROUGH `cut -f 2`;
grunt>
DUMP C;
(cherry)
(apple)
(banana)
(apple)
The
STREAM
operator uses PigStorage to serialize and deserialize relations to and from
the program's standard input and output streams. Tuples in
A
are converted to tab-delim-
ited lines that are passed to the script. The output of the script is read one line at a time
and split on tabs to create new tuples for the output relation
C
. You can provide a custom
serializer and deserializer by subclassing
PigStreamingBase
(in the
org.apache.pig
package), then using the
DEFINE
operator.
Pig streaming is most powerful when you write custom processing scripts. The following
Python script filters out bad weather records:
#!/usr/bin/env python
import
re
import
sys
for
line
in
sys
.
stdin
:
(
year
,
temp
,
q
) =
line
.
strip
().
split
()
if
(
temp
!=
"9999"
and
re
.
match
(
"[01459]"
,
q
)):
print
"
%s
\t
%s
"
% (
year
,
temp
)
To use the script, you need to ship it to the cluster. This is achieved via a
DEFINE
clause,
which also creates an alias for the
STREAM
command. The
STREAM
statement can then
refer to the alias, as the following Pig script shows: