Programming Practices - Hadoop in Action

Databases Reference

In-Depth Information

We indeed get the two records as expected. You can verify a few more records to gain

confidence in the correctness of your program's math and logic. 1

An eyesore about the output of this inverted citation graph is that the first line is

not real data.

"CITED" "CITING"

It's an artifact from the first line of the input data being used as data definition. Let's

add some code to our mapper to filter out non-numeric keys and values, and in the

process demonstrate regression testing.

REGRESSION TESTING

Our data-centric approach to regression testing revolves around “diff'ing” various out-

put files from before and after code changes. For our particular change, we should

only be taking out one line from the job's output. To verify that this indeed is the case,

let's first save the output of our current job. In local mode, we have a maximum of only

one reducer, so the job's output is only one file, which we call job_1_output .

For regression testing, it's also useful to save the output of the map phase. This will

help us isolate bugs to either the map phase or the reduce phase. We can save the

output of the map phase by running the MapReduce job with zero reducers. We can

do this easily using the -D mapred.reduce.tasks=0 option. In this mapper-only job,

there will be multiple files as each map task will write its output to its own file. Let's

copy all of them into a directory called job_1_intermediate .

Having stored away the output files, we can make the desired code changes to the

map() method in MapClass . The code itself is trivial. We focus on testing it.

public void map(Text key, Text value,

OutputCollector<Text, Text> output,

Reporter reporter) throws IOException {

try

{

if (Integer.parseInt(key.toString()) > 0 &&

Integer.parseInt(value.toString()) > 0)

{

output.collect(value, key);

}

} catch (NumberFormatException e) { }

}

Compile and execute the new code against the same input data. Let's run it as a map-

only job first and compare the intermediate data. As we've only changed the mapper,

any bug should first manifest in differences in the intermediate data.

diff output/job_1_intermediate/ output/test/

1

In this case, you may suspect whether patent number 1 is really cited by those two patents. The number 1

feels wrong, an outlier in the range of patent numbers being cited. There can be mistakes in the original

input data. We have to track down the patents themselves if we want to verify this. In any case, ensuring data

quality is an important topic but is beyond our discussion of Hadoop.

Hadoop in Action

Search WWH ::

Custom Search

Home