Processing Data with Map Reduce - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

148 else

149 {

150 other_args.add(args[i]);

151 }

152 }

153

154 FileInputFormat.setInputPaths(conf, new Path(other_args.get(0)));

155 FileOutputFormat.setOutputPath(conf, new Path(other_args.get(1)));

156

157 JobClient.runJob(conf);

158 return 0;

159 }

160 /*------------------------------------------------------------------*/

161 public static void main(String[] args) throws Exception

162 {

163 int res = ToolRunner.run(new Configuration(), new WordCount(), args);

164 System.exit(res);

165 }

166

167 } /* class word count*/

Describing the Example 2 Code

Take a closer look at the code for the simpler example, given earlier. Note that line 1 defines the package name as org.

myorg and lines 6 through 11 import the Hadoop functionality for Path, configuration, I/O, Map Reduce, and utilities.

New to this second example is the cache definition, which is used to store the configurations pattern file (which will

be described later):

07 import org.apache.hadoop.filecache.DistributedCache;

Line 13 defines the main WordCount class:

13 public class WordCount extends Configured implements Tool

Meanwhile, the Map class is defined at line 17:

17 public static class Map extends MapReduceBase

18 implements Mapper < LongWritable, Text, Text, IntWritable >

This class now has a configure method defined at line 36, which offers case-sensitivity and pattern-skipping

functionality:

36 public void configure(JobConf job)

The parseSkipFile method at line 60 parses the pattern file for the pattern-skipping functionality just

mentioned. The patternsFile contains a list of patterns that should be removed from the text to be processed when

counting words:

60 private void parseSkipFile(Path patternsFile)

Search WWH ::

Custom Search

Home