Database Reference
In-Depth Information
Figure 6-3. Screenshot of the tasks page
Clicking on the task link takes us to the task attempts page, which shows each task at-
tempt for the task. Each task attempt page has links to the logfiles and counters. If we fol-
low one of the links to the logfiles for the successful task attempt, we can find the suspect
input record that we logged (the line is wrapped and truncated to fit on the page):
Temperature over 100 degrees for input:
0335999999433181957042302005+37950+139117SAO +0004RJSN
V02011359003150070356999
9994332 01957 010100005+35317+139650SAO
+000899999V02002359002650076249N0040005...
This record seems to be in a different format from the others. For one thing, there are
spaces in the line, which are not described in the specification.
When the job has finished, we can look at the value of the counter we defined to see how
many records over 100°C there are in the whole dataset. Counters are accessible via the
web UI or the command line:
% mapred job -counter job_1410450250506_0006 \
'v3.MaxTemperatureMapper$Temperature' OVER_100
3
The -counter option takes the job ID, counter group name (which is the fully qualified
classname here), and counter name (the enum name). There are only three malformed re-
cords in the entire dataset of over a billion records. Throwing out bad records is standard
for many big data problems, although we need to be careful in this case because we are
looking for an extreme value — the maximum temperature rather than an aggregate meas-
ure. Still, throwing away three records is probably not going to change the result.
Search WWH ::




Custom Search