Databases Reference
In-Depth Information
chapter 1 once more. If the job processes a document containing the word “the” 574
times, it's much more efficient to store and shuffle the pair (“the”, 574) once instead
of the pair (“the”, 1) multiple times. This processing step is known as combining. We
explain combiners in more depth in section 4.6.
3.2.6
Word counting with predefined mapper and reducer classes
We have concluded our preliminary coverage of all the basic components of MapReduce.
Now that you've seen more classes provided by Hadoop, it'll be fun to revisit the Word-
Count example (see listing 3.3), using some of the classes we've learned.
Listing 3.3 Revised version of the WordCount example
public class WordCount2 {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf = new JobConf(WordCount2.class);
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(LongWritable.class);
conf.setMapperClass(TokenCountMapper.class); q
conf.setCombinerClass(LongSumReducer.class);
conf.setReducerClass(LongSumReducer.class); w
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
We have to write only the driver for this MapReduce program because we have used
Hadoop's predefined TokenCountMapper class
Hadoop's own
TokenCountMapper
Hadoop's own
LongSumReducer
w . Easy,
isn't it? Hadoop provides the ability to generate more sophisticated programs (this will
be the focus of part 2 of the topic), but we want to emphasize that Hadoop allows you
to rapidly generate useful programs with a minimal amount of code.
q and LongSumReducer class
3.3
Reading and writing
Let's see how MapReduce reads input data and writes output data and focus on the
file formats it uses. To enable easy distributed processing, MapReduce makes certain
assumptions about the data it's processing. It also provides flexibility in dealing with a
variety of data formats.
Input data usually resides in large files, typically tens or hundreds of gigabytes or
even more. One of the fundamental principles of MapReduce's processing power is
the splitting of the input data into chunks . You can process these chunks in parallel
using multiple machines. In Hadoop terminology these chunks are called input splits .
 
 
Search WWH ::




Custom Search