Database Reference
In-Depth Information
Data Processing Operators
Loading and Storing Data
Throughout this chapter, we have seen how to load data from external storage for process-
ing in Pig. Storing the results is straightforward, too. Here's an example of using PigSt-
orage to store tuples as plain-text values separated by a colon character:
grunt> STORE A INTO 'out' USING PigStorage(':');
grunt> cat out
Joe:cherry:2
Ali:apple:3
Joe:banana:2
Eve:apple:7
Other built-in storage functions were described in Table 16-7 .
Filtering Data
Once you have some data loaded into a relation, often the next step is to filter it to remove
the data that you are not interested in. By filtering early in the processing pipeline, you
minimize the amount of data flowing through the system, which can improve efficiency.
FOREACH...GENERATE
We have already seen how to remove rows from a relation using the FILTER operator with
simple expressions and a UDF. The FOREACH...GENERATE operator is used to act on
every row in a relation. It can be used to remove fields or to generate new ones. In this ex-
ample, we do both:
grunt> DUMP A;
(Joe,cherry,2)
(Ali,apple,3)
(Joe,banana,2)
(Eve,apple,7)
grunt> B = FOREACH A GENERATE $0, $2+1, 'Constant';
grunt> DUMP B;
(Joe,3,Constant)
(Ali,4,Constant)
(Joe,3,Constant)
(Eve,8,Constant)
Here we have created a new relation, B , with three fields. Its first field is a projection of the
first field ( $0 ) of A . B 's second field is the third field of A ( $2 ) with 1 added to it. B 's third
Search WWH ::




Custom Search