Processing Data with Map Reduce - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

The input file on the left of Figure 4-1 is split into chunks of data. The size of these splits is controlled by the

InputSplit method within the FileInputFormat class of the Map Reduce job. The number of splits is influenced by the

HDFS block size, and one mapper job is created for each split data chunk.

Each split data chunk—that is, “one one two three,” is sent to a Mapper process on a Data Node server. In the

word-count example, the Map process creates a series of key-value pairs where the key is the word—for instance,

“one”—and the value is the count of 1. These key-value pairs are then shuffled into lists by key type. The shuffled lists

are input to Reduce tasks, which reduce the list volume by summing the values (in this simple example). The Reduce

output is then a simple list of summed key-value pairs.

The Map Reduce framework takes care of all other tasks, like scheduling and resources.

Map Reduce Native

Now, it's time to put the word-count algorithm to work, starting with the most basic example: the Map Reduce native

version. The term Map Reduce native means that the Map Reduce code is written in Java using the functionality

provided by the Hadoop core libraries within the Hadoop installation directory. Map Reduce native word-count

algorithms are available from the Apache Software Foundation Hadoop 1.2.1 Map Reduce tutorial on the Hadoop

website ( hadoop.apache.org/docs/r1.2.1/ ). For instance, I sourced two versions of the Hadoop word-count

algorithm and stored them in Java files in a word-count directory, as follows:

[hadoop@hc1nn wordcount]$ pwd

/usr/local/hadoop/wordcount

[hadoop@hc1nn wordcount]$ ls *.java

wc-ex1.java wc-ex2.java

The file wc-ex1.java contains the first simple example, while the second Java file contains a second, more

complex version.

Java Word-Count Example 1

Consider the Java code for the first word-count example. It follows the basic word-count steps shown in Figure 4-1 :

01 package org.myorg;

02

03 import java.io.IOException;

04 import java.util.*;

05

06 import org.apache.hadoop.fs.Path;

07 import org.apache.hadoop.conf.*;

08 import org.apache.hadoop.io.*;

09 import org.apache.hadoop.mapred.*;

10 import org.apache.hadoop.util.*;

11

12 public class WordCount

13 {

14

15 public static class Map extends MapReduceBase implements

16 Mapper<LongWritable, Text, Text, IntWritable>

17 {

18 private final static IntWritable one = new IntWritable(1);

19 private Text word = new Text();

20

Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Search WWH ::

Custom Search

Home