Database Reference
In-Depth Information
per and reducer via the Hadoop job configuration. (Techniques for achieving this are dis-
cussed in Side Data Distribution . )
There are a couple of differences from the regular Hadoop MapReduce API. The first is
the use of wrappers around Avro Java types. For this MapReduce program, the key is the
year (an integer), and the value is the weather record, which is represented by Avro's
GenericRecord . This translates to AvroKey<Integer> for the key type and Av-
roValue<GenericRecord> for the value type in the map output (and reduce input).
The MaxTemperatureReducer iterates through the records for each key (year) and
finds the one with the maximum temperature. It is necessary to make a copy of the record
with the highest temperature found so far, since the iterator reuses the instance for reasons
of efficiency (and only the fields are updated).
The second major difference from regular MapReduce is the use of AvroJob for config-
uring the job. AvroJob is a convenience class for specifying the Avro schemas for the
input, map output, and final output data. In this program, no input schema is set, because
we are reading from a text file. The map output key schema is an Avro int and the value
schema is the weather record schema. The final output key schema is the weather record
schema, and the output format is AvroKeyOutputFormat , which writes keys to Avro
datafiles and ignores the values (which are NullWritable ).
The following commands show how to run the program on a small sample dataset:
% export HADOOP_CLASSPATH=avro-examples.jar
% export HADOOP_USER_CLASSPATH_FIRST=true # override version of Avro
in Hadoop
% hadoop jar avro-examples.jar AvroGenericMaxTemperature \
input/ncdc/sample.txt output
On completion we can look at the output using the Avro tools JAR to render the Avro
datafile as JSON, one record per line:
% java -jar $AVRO_HOME/avro-tools-*.jar tojson output/
part-r-00000.avro
{"year":1949,"temperature":111,"stationId":"012650-99999"}
{"year":1950,"temperature":22,"stationId":"011990-99999"}
In this example we read a text file and created an Avro datafile, but other combinations
are possible, which is useful for converting between Avro formats and other formats (such
as SequenceFiles ). See the documentation for the Avro MapReduce package for de-
tails.
Search WWH ::




Custom Search