Database Reference
In-Depth Information
Chapter 4
Processing Data with Map Reduce
Hadoop Map Reduce is a system for parallel processing of very large data sets using distributed fault-tolerant storage
over very large clusters. The input data set is broken down into pieces, which are the inputs to the Map functions.
The Map functions then filter and sort these data chunks (whose size is configurable) on the Hadoop cluster data
nodes. The output of the Map processes is delivered to the Reduce processes, which shuffle and summarize the data
to produce the resulting output.
This chapter explores Map Reduce programming through multiple implementations of a simple, but flexible
word-count algorithm. After first coding the word-count algorithm in raw Java using classes provided by Hadoop
libraries, you will then learn to carry out the same word-count function in Pig Latin, Hive, and Perl. Don't worry about
these terms yet; you will learn to source, install, and work with this software in the coming sections.
An Overview of the Word-Count Algorithm
The best way to understand the Map Reduce system is with an example. A word-count algorithm is not only the
most common and simplest Map Reduce example, but it is also one that contains techniques you can apply to more
complex scenarios. To begin, consider Figure 4-1 , which breaks the word-count process into steps.
Figure 4-1. The word-count Map Reduce process
 
Search WWH ::




Custom Search