Case Studies - Hadoop in Action

Databases Reference

In-Depth Information

int count = 0;

while (values.hasNext()) {

values.next();

count++;

}

output.collect(key, new Text(Integer.toString(count)));

}

public static void main(String []args) throws IOException {

if (args.length < 2) {

System.out.println("Give the name of the by-userid stumble table");

return;

}

JobConf job = new JobConf(CountUserUrlStumbles.class);

job.setInputFormat(TableInputFormat.class);

FileInputFormat.setInputPaths(job, args[0]);

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

job.setOutputFormat(TextOutputFormat.class);

TextOutputFormat.setOutputPath(job, new Path(args[1]));

job.setNumMapTasks(5000);

JobClient jc = new JobClient(job);

jc.submitJob(job);

}

In this example, we look at a routine StumbleUpon task: counting stumbles per

user as well as stumbles per URL. Although this task is not particularly complex or

insightful, we provide it here as a concrete example to the reader of a type of analytic

task we perform on a daily basis. The most interesting bit is that this trivial example

completes in about 1 hour (using twenty commodity nodes) when processing a key

count in the tens of billions. The MySQL-based counterpart doesn't complete in a

reasonable amount of time—at least not without special handling and support to

dump the data from MySQL, split the lines to a reasonable chunk size, and then

combine the results.

You may find this series of operations familiar: mapping, then reducing! By using

the generalized facilities of both HBase and Hadoop, we are able to conduct similar

statistical surveys as needed, without special preparation and runtime handling. To

apply this straightforward example to the real world, we are now able to complete

all analysis tasks in the same day they're requested. We can provide the ability to

run ad hoc queries at a rate not thought possible before Hadoop and HBase were

powering our platform. As a business thrives and dies on the data it can analyze,

this decreased turnaround time makes an incredible impact from the front office

number crunching to the research engineers doing instant spam analysis on content

submissions.

One can only imagine the difficulty of refactoring the custom-processing pipeline

when the data schema is more complex than this trivial example, if we didn't have our

distributed processing platform to power the extraction, transformation, and analysis.

Hadoop in Action

Search WWH ::

Custom Search

Home