Databases Reference
In-Depth Information
We indeed get the two records as expected. You can verify a few more records to gain
confidence in the correctness of your program's math and logic. 1
An eyesore about the output of this inverted citation graph is that the first line is
not real data.
"CITED" "CITING"
It's an artifact from the first line of the input data being used as data definition. Let's
add some code to our mapper to filter out non-numeric keys and values, and in the
process demonstrate regression testing.
REGRESSION TESTING
Our data-centric approach to regression testing revolves around “diff'ing” various out-
put files from before and after code changes. For our particular change, we should
only be taking out one line from the job's output. To verify that this indeed is the case,
let's first save the output of our current job. In local mode, we have a maximum of only
one reducer, so the job's output is only one file, which we call job_1_output .
For regression testing, it's also useful to save the output of the map phase. This will
help us isolate bugs to either the map phase or the reduce phase. We can save the
output of the map phase by running the MapReduce job with zero reducers. We can
do this easily using the -D mapred.reduce.tasks=0 option. In this mapper-only job,
there will be multiple files as each map task will write its output to its own file. Let's
copy all of them into a directory called job_1_intermediate .
Having stored away the output files, we can make the desired code changes to the
map() method in MapClass . The code itself is trivial. We focus on testing it.
public void map(Text key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
try
{
if (Integer.parseInt(key.toString()) > 0 &&
Integer.parseInt(value.toString()) > 0)
{
output.collect(value, key);
}
} catch (NumberFormatException e) { }
}
Compile and execute the new code against the same input data. Let's run it as a map-
only job first and compare the intermediate data. As we've only changed the mapper,
any bug should first manifest in differences in the intermediate data.
diff output/job_1_intermediate/ output/test/
1
In this case, you may suspect whether patent number 1 is really cited by those two patents. The number 1
feels wrong, an outlier in the range of patent numbers being cited. There can be mistakes in the original
input data. We have to track down the patents themselves if we want to verify this. In any case, ensuring data
quality is an important topic but is beyond our discussion of Hadoop.
 
Search WWH ::




Custom Search