Crunch - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

PGroupedTable<K, V> , which has a combineValues() method for performing

aggregation of all the values for a key, just like a MapReduce reducer:

PTable < String , Integer > maxTemps = yearTemperatures

. groupByKey ()

. combineValues ( Aggregators . MAX_INTS ());

The combineValues() method accepts an instance of a Crunch Aggregator , a

simple interface for expressing any kind of aggregation of a stream of values, and here we

can take advantage of a built-in aggregator from the Aggregators class called

MAX_INTS that finds the maximum value from a set of integers.

The final step in the pipeline is writing the maxTemps table to a file by calling write()

with a text file target object constructed using the To static factory. Crunch actually uses

Hadoop's TextOutputFormat for this operation, which means that the key and value

in each line of output are separated by a tab:

maxTemps . write ( To . textFile ( args [ 1 ]));

The program so far has only been concerned with pipeline construction. To execute a

pipeline, we have to call the done() method, at which point the program blocks until the

pipeline completes. Crunch returns a PipelineResult object that encapsulates vari-

ous statistics about the different jobs that were run in the pipeline, as well as whether the

pipeline succeeded or not. We use the latter information to set the program's exit code ap-

propriately.

When we run the program on the sample dataset, we get the following result:

% hadoop jar crunch-examples.jar crunch.MaxTemperatureCrunch \

input/ncdc/sample.txt output

% cat output/part-r-00000

1949 111

1950 22

Search WWH ::

Custom Search

Home