Pig - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

(1949,111,1)

(1949,78,1)

Pig produces a warning for the invalid field (not shown here) but does not halt its process-

ing. For large datasets, it is very common to have corrupt, invalid, or merely unexpected

data, and it is generally infeasible to incrementally fix every unparsable record. Instead,

we can pull out all of the invalid records in one go so we can take action on them, perhaps

by fixing our program (because they indicate that we have made a mistake) or by filtering

them out (because the data is genuinely unusable):

grunt> corrupt_records = FILTER records BY temperature is null;

grunt> DUMP corrupt_records;

(1950,,1)

Note the use of the is null operator, which is analogous to SQL. In practice, we would

include more information from the original record, such as an identifier and the value that

could not be parsed, to help our analysis of the bad data.

We can find the number of corrupt records using the following idiom for counting the

number of rows in a relation:

grunt> grouped = GROUP corrupt_records ALL;

grunt> all_grouped = FOREACH grouped GENERATE group,

COUNT(corrupt_records);

grunt> DUMP all_grouped;

(all,1)

( GROUP explains grouping and the ALL operation in more detail.)

Another useful technique is to use the SPLIT operator to partition the data into “good”

and “bad” relations, which can then be analyzed separately:

grunt> SPLIT records INTO good_records IF temperature is not null,

>> bad_records OTHERWISE;

grunt> DUMP good_records;

(1950,0,1)

(1950,22,1)

(1949,111,1)

(1949,78,1)

grunt> DUMP bad_records;

(1950,,1)

Going back to the case in which temperature 's type was left undeclared, the corrupt

data cannot be detected easily, since it doesn't surface as a null :

grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'

>>

AS (year:chararray, temperature, quality:int);

Search WWH ::

Custom Search

Home