Components of Hadoop - Hadoop in Action

Databases Reference

In-Depth Information

chapter 1 once more. If the job processes a document containing the word “the” 574

times, it's much more efficient to store and shuffle the pair (“the”, 574) once instead

of the pair (“the”, 1) multiple times. This processing step is known as combining. We

explain combiners in more depth in section 4.6.

3.2.6

Word counting with predefined mapper and reducer classes

We have concluded our preliminary coverage of all the basic components of MapReduce.

Now that you've seen more classes provided by Hadoop, it'll be fun to revisit the Word-

Count example (see listing 3.3), using some of the classes we've learned.

Listing 3.3 Revised version of the WordCount example

public class WordCount2 {

public static void main(String[] args) {

JobClient client = new JobClient();

JobConf conf = new JobConf(WordCount2.class);

FileInputFormat.addInputPath(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(LongWritable.class);

conf.setMapperClass(TokenCountMapper.class); q

conf.setCombinerClass(LongSumReducer.class);

conf.setReducerClass(LongSumReducer.class); w

client.setConf(conf);

try {

JobClient.runJob(conf);

} catch (Exception e) {

e.printStackTrace();

}

We have to write only the driver for this MapReduce program because we have used

Hadoop's predefined TokenCountMapper class

Hadoop's own

TokenCountMapper

Hadoop's own

LongSumReducer

w . Easy,

isn't it? Hadoop provides the ability to generate more sophisticated programs (this will

be the focus of part 2 of the topic), but we want to emphasize that Hadoop allows you

to rapidly generate useful programs with a minimal amount of code.

q and LongSumReducer class

3.3

Reading and writing

Let's see how MapReduce reads input data and writes output data and focus on the

file formats it uses. To enable easy distributed processing, MapReduce makes certain

assumptions about the data it's processing. It also provides flexibility in dealing with a

variety of data formats.

Input data usually resides in large files, typically tens or hundreds of gigabytes or

even more. One of the fundamental principles of MapReduce's processing power is

the splitting of the input data into chunks . You can process these chunks in parallel

using multiple machines. In Hadoop terminology these chunks are called input splits .

Hadoop in Action

Search WWH ::

Custom Search

Home