Writing basic MapReduce programs - Hadoop in Action

Databases Reference

In-Depth Information

MapReduce framework fundamentally designed for implementing commutative

functions? Why or why not?

Multiplication (product)— Many machine-learning

and statistical-classification

algorithms involve the multiplication of a large number of probability values.

Usually we compare the product of one set of probabilities to the product of a

different set, and choose a classification corresponding to the bigger product.

We've seen that maximum is a distributive function. Is the product also distributive?

Write a MapReduce program that multiplies all values in a data set. For full

credit, apply the program to a reasonably large data set. Does implementing the

program in MapReduce solve all scalability issues? What should you do to fix it?

(Writing your own floating-point library is a popular answer, but not a

good one.)

6

Translation into fictional dialect— A popular assignment in introductory computer

science classes is to write a program that converts English to “pirate-speak.” Many

variations of the exercise exist for other semi-fictional dialects, such as “Snoop

Dogg” and “E-40.” Usually the solution involves a dictionary look-up for exact

word matches (“for” becomes “fo,” “sure” becomes “sho,” “the” becomes “da,”

etc.), simple text rules (words ending in “ing” now ends in “in',” replace the last

vowel of a word and everything after it with “izzle,” etc.), and random injections

(“kno' wha' im sayin'?”). Write such translations and use Hadoop to apply it to a

large corpus such as Wikipedia.

7

4.8

Summary

MapReduce programs follow a template. Often the whole program is defined within

a single Java class. Within the class, a driver sets up a MapReduce job's configuration

object, which is used as the blueprint for how the job is set up and run. You'll find the

map and reduce functions in subclasses of Mapper and Reducer , respectively. Those

classes are often no more than a couple dozen lines long, so they're usually written as

inner classes for convenience.

Hadoop provides a Streaming API for writing MapReduce programs in a language

other than Java. Many MapReduce programs are much easier to develop in a scripting

language using the Streaming API, especially for ad hoc data analysis. The Aggregate

package, when used with Streaming, enables one to rapidly write programs for counting

and getting basic statistics.

MapReduce programs are largely about the map and the reduce functions, but

Hadoop allows for a combiner function to improve performance by “pre-reducing”

the intermediate data at the mapper before the reduce phase.

In standard programming (outside of the MapReduce paradigm), counting,

summing, averaging, and so on are usually done through a simple, single pass of the

data. Refactoring those programs to run in MapReduce, as we've done in this chapter,

is relatively straightforward conceptually. More complex data analysis algorithms call

for deeper reworking of the algorithms, which we cover in the next chapter.

Hadoop in Action

Search WWH ::

Custom Search

Home