Developing a MapReduce Application - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

We sorted the output, as the reduce output partitions are unordered (owing to the hash par-

tition function). Doing a bit of postprocessing of data from MapReduce is very common,

as is feeding it into analysis tools such as R, a spreadsheet, or even a relational database.

Another way of retrieving the output if it is small is to use the -cat option to print the

output files to the console:

% hadoop fs -cat max-temp/*

On closer inspection, we see that some of the results don't look plausible. For instance,

the maximum temperature for 1951 (not shown here) is 590°C! How do we find out

what's causing this? Is it corrupt input data or a bug in the program?

Debugging a Job

The time-honored way of debugging programs is via print statements, and this is certainly

possible in Hadoop. However, there are complications to consider: with programs running

on tens, hundreds, or thousands of nodes, how do we find and examine the output of the

debug statements, which may be scattered across these nodes? For this particular case,

where we are looking for (what we think is) an unusual case, we can use a debug state-

ment to log to standard error, in conjunction with updating the task's status message to

prompt us to look in the error log. The web UI makes this easy, as we pass:[will see].

We also create a custom counter to count the total number of records with implausible

temperatures in the whole dataset. This gives us valuable information about how to deal

with the condition. If it turns out to be a common occurrence, we might need to learn

more about the condition and how to extract the temperature in these cases, rather than

simply dropping the records. In fact, when trying to debug a job, you should always ask

yourself if you can use a counter to get the information you need to find out what's hap-

pening. Even if you need to use logging or a status message, it may be useful to use a

counter to gauge the extent of the problem. (There is more on counters in Counters .)

If the amount of log data you produce in the course of debugging is large, you have a

couple of options. One is to write the information to the map's output, rather than to

standard error, for analysis and aggregation by the reduce task. This approach usually ne-

cessitates structural changes to your program, so start with the other technique first. The

alternative is to write a program (in MapReduce, of course) to analyze the logs produced

by your job.

We add our debugging to the mapper (version 3), as opposed to the reducer, as we want to

find out what the source data causing the anomalous output looks like:

public class MaxTemperatureMapper

extends Mapper < LongWritable , Text , Text , IntWritable > {

Search WWH ::

Custom Search

Home