Database Reference
In-Depth Information
We sorted the output, as the reduce output partitions are unordered (owing to the hash par-
tition function). Doing a bit of postprocessing of data from MapReduce is very common,
as is feeding it into analysis tools such as R, a spreadsheet, or even a relational database.
Another way of retrieving the output if it is small is to use the -cat option to print the
output files to the console:
% hadoop fs -cat max-temp/*
On closer inspection, we see that some of the results don't look plausible. For instance,
the maximum temperature for 1951 (not shown here) is 590°C! How do we find out
what's causing this? Is it corrupt input data or a bug in the program?
Debugging a Job
The time-honored way of debugging programs is via print statements, and this is certainly
possible in Hadoop. However, there are complications to consider: with programs running
on tens, hundreds, or thousands of nodes, how do we find and examine the output of the
debug statements, which may be scattered across these nodes? For this particular case,
where we are looking for (what we think is) an unusual case, we can use a debug state-
ment to log to standard error, in conjunction with updating the task's status message to
prompt us to look in the error log. The web UI makes this easy, as we pass:[will see].
We also create a custom counter to count the total number of records with implausible
temperatures in the whole dataset. This gives us valuable information about how to deal
with the condition. If it turns out to be a common occurrence, we might need to learn
more about the condition and how to extract the temperature in these cases, rather than
simply dropping the records. In fact, when trying to debug a job, you should always ask
yourself if you can use a counter to get the information you need to find out what's hap-
pening. Even if you need to use logging or a status message, it may be useful to use a
counter to gauge the extent of the problem. (There is more on counters in Counters .)
If the amount of log data you produce in the course of debugging is large, you have a
couple of options. One is to write the information to the map's output, rather than to
standard error, for analysis and aggregation by the reduce task. This approach usually ne-
cessitates structural changes to your program, so start with the other technique first. The
alternative is to write a program (in MapReduce, of course) to analyze the logs produced
by your job.
We add our debugging to the mapper (version 3), as opposed to the reducer, as we want to
find out what the source data causing the anomalous output looks like:
public class MaxTemperatureMapper
extends Mapper < LongWritable , Text , Text , IntWritable > {
Search WWH ::




Custom Search