Databases Reference
In-Depth Information
int count = 0;
while (values.hasNext()) {
values.next();
count++;
}
output.collect(key, new Text(Integer.toString(count)));
}
}
public static void main(String []args) throws IOException {
if (args.length < 2) {
System.out.println("Give the name of the by-userid stumble table");
return;
}
JobConf job = new JobConf(CountUserUrlStumbles.class);
job.setInputFormat(TableInputFormat.class);
FileInputFormat.setInputPaths(job, args[0]);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputFormat(TextOutputFormat.class);
TextOutputFormat.setOutputPath(job, new Path(args[1]));
job.setNumMapTasks(5000);
JobClient jc = new JobClient(job);
jc.submitJob(job);
}
}
In this example, we look at a routine StumbleUpon task: counting stumbles per
user as well as stumbles per URL. Although this task is not particularly complex or
insightful, we provide it here as a concrete example to the reader of a type of analytic
task we perform on a daily basis. The most interesting bit is that this trivial example
completes in about 1 hour (using twenty commodity nodes) when processing a key
count in the tens of billions. The MySQL-based counterpart doesn't complete in a
reasonable amount of timeā€”at least not without special handling and support to
dump the data from MySQL, split the lines to a reasonable chunk size, and then
combine the results.
You may find this series of operations familiar: mapping, then reducing! By using
the generalized facilities of both HBase and Hadoop, we are able to conduct similar
statistical surveys as needed, without special preparation and runtime handling. To
apply this straightforward example to the real world, we are now able to complete
all analysis tasks in the same day they're requested. We can provide the ability to
run ad hoc queries at a rate not thought possible before Hadoop and HBase were
powering our platform. As a business thrives and dies on the data it can analyze,
this decreased turnaround time makes an incredible impact from the front office
number crunching to the research engineers doing instant spam analysis on content
submissions.
One can only imagine the difficulty of refactoring the custom-processing pipeline
when the data schema is more complex than this trivial example, if we didn't have our
distributed processing platform to power the extraction, transformation, and analysis.
 
Search WWH ::




Custom Search