Database Reference
In-Depth Information
The fields are year, number of unique stations, total number of good readings, and total
number of readings. We can see how the number of weather stations and readings grew
over time.
STREAM
The STREAM operator allows you to transform data in a relation using an external pro-
gram or script. It is named by analogy with Hadoop Streaming, which provides a similar
capability for MapReduce (see Hadoop Streaming ).
STREAM can use built-in commands with arguments. Here is an example that uses the
Unix cut command to extract the second field of each tuple in A . Note that the command
and its arguments are enclosed in backticks:
grunt> C = STREAM A THROUGH `cut -f 2`;
grunt> DUMP C;
(cherry)
(apple)
(banana)
(apple)
The STREAM operator uses PigStorage to serialize and deserialize relations to and from
the program's standard input and output streams. Tuples in A are converted to tab-delim-
ited lines that are passed to the script. The output of the script is read one line at a time
and split on tabs to create new tuples for the output relation C . You can provide a custom
serializer and deserializer by subclassing PigStreamingBase (in the
org.apache.pig package), then using the DEFINE operator.
Pig streaming is most powerful when you write custom processing scripts. The following
Python script filters out bad weather records:
#!/usr/bin/env python
import re
import sys
for line in sys . stdin :
( year , temp , q ) = line . strip (). split ()
if ( temp != "9999" and re . match ( "[01459]" , q )):
print " %s \t %s " % ( year , temp )
To use the script, you need to ship it to the cluster. This is achieved via a DEFINE clause,
which also creates an alias for the STREAM command. The STREAM statement can then
refer to the alias, as the following Pig script shows:
Search WWH ::




Custom Search