Database Reference
In-Depth Information
(1949,111,1)
(1949,78,1)
Pig produces a warning for the invalid field (not shown here) but does not halt its process-
ing. For large datasets, it is very common to have corrupt, invalid, or merely unexpected
data, and it is generally infeasible to incrementally fix every unparsable record. Instead,
we can pull out all of the invalid records in one go so we can take action on them, perhaps
by fixing our program (because they indicate that we have made a mistake) or by filtering
them out (because the data is genuinely unusable):
grunt> corrupt_records = FILTER records BY temperature is null;
grunt> DUMP corrupt_records;
(1950,,1)
Note the use of the is null operator, which is analogous to SQL. In practice, we would
include more information from the original record, such as an identifier and the value that
could not be parsed, to help our analysis of the bad data.
We can find the number of corrupt records using the following idiom for counting the
number of rows in a relation:
grunt> grouped = GROUP corrupt_records ALL;
grunt> all_grouped = FOREACH grouped GENERATE group,
COUNT(corrupt_records);
grunt> DUMP all_grouped;
(all,1)
( GROUP explains grouping and the ALL operation in more detail.)
Another useful technique is to use the SPLIT operator to partition the data into “good”
and “bad” relations, which can then be analyzed separately:
grunt> SPLIT records INTO good_records IF temperature is not null,
>> bad_records OTHERWISE;
grunt> DUMP good_records;
(1950,0,1)
(1950,22,1)
(1949,111,1)
(1949,78,1)
grunt> DUMP bad_records;
(1950,,1)
Going back to the case in which temperature 's type was left undeclared, the corrupt
data cannot be detected easily, since it doesn't surface as a null :
grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'
>>
AS (year:chararray, temperature, quality:int);
Search WWH ::




Custom Search