Pig - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Data Processing Operators

Loading and Storing Data

Throughout this chapter, we have seen how to load data from external storage for process-

ing in Pig. Storing the results is straightforward, too. Here's an example of using PigSt-

orage to store tuples as plain-text values separated by a colon character:

grunt> STORE A INTO 'out' USING PigStorage(':');

grunt> cat out

Joe:cherry:2

Ali:apple:3

Joe:banana:2

Eve:apple:7

Other built-in storage functions were described in Table 16-7 .

Filtering Data

Once you have some data loaded into a relation, often the next step is to filter it to remove

the data that you are not interested in. By filtering early in the processing pipeline, you

minimize the amount of data flowing through the system, which can improve efficiency.

FOREACH...GENERATE

We have already seen how to remove rows from a relation using the FILTER operator with

simple expressions and a UDF. The FOREACH...GENERATE operator is used to act on

every row in a relation. It can be used to remove fields or to generate new ones. In this ex-

ample, we do both:

grunt> DUMP A;

(Joe,cherry,2)

(Ali,apple,3)

(Joe,banana,2)

(Eve,apple,7)

grunt> B = FOREACH A GENERATE $0, $2+1, 'Constant';

grunt> DUMP B;

(Joe,3,Constant)

(Ali,4,Constant)

(Joe,3,Constant)

(Eve,8,Constant)

Here we have created a new relation, B , with three fields. Its first field is a projection of the

first field ( $0 ) of A . B 's second field is the third field of A ( $2 ) with 1 added to it. B 's third

Search WWH ::

Custom Search

Home