Database Reference
In-Depth Information
Counters
There are often things that you would like to know about the data you are analyzing but
that are peripheral to the analysis you are performing. For example, if you were counting
invalid records and discovered that the proportion of invalid records in the whole dataset
was very high, you might be prompted to check why so many records were being marked
as invalid — perhaps there is a bug in the part of the program that detects invalid records?
Or if the data was of poor quality and genuinely did have very many invalid records, after
discovering this, you might decide to increase the size of the dataset so that the number of
good records was large enough for meaningful analysis.
Counters are a useful channel for gathering statistics about the job: for quality control or
for application-level statistics. They are also useful for problem diagnosis. If you are temp-
ted to put a log message into your map or reduce task, it is often better to see whether you
can use a counter instead to record that a particular condition occurred. In addition to
counter values being much easier to retrieve than log output for large distributed jobs, you
get a record of the number of times that condition occurred, which is more work to obtain
from a set of logfiles.
Built-in Counters
Hadoop maintains some built-in counters for every job, and these report various metrics.
For example, there are counters for the number of bytes and records processed, which al-
low you to confirm that the expected amount of input was consumed and the expected
amount of output was produced.
Counters are divided into groups, and there are several groups for the built-in counters, lis-
ted in Table 9-1 .
Search WWH ::




Custom Search