Databases Reference
In-Depth Information
MapReduce framework fundamentally designed for implementing commutative
functions? Why or why not?
Multiplication (product)— Many machine-learning
and statistical-classification
algorithms involve the multiplication of a large number of probability values.
Usually we compare the product of one set of probabilities to the product of a
different set, and choose a classification corresponding to the bigger product.
We've seen that maximum is a distributive function. Is the product also distributive?
Write a MapReduce program that multiplies all values in a data set. For full
credit, apply the program to a reasonably large data set. Does implementing the
program in MapReduce solve all scalability issues? What should you do to fix it?
(Writing your own floating-point library is a popular answer, but not a
good one.)
6
Translation into fictional dialect— A popular assignment in introductory computer
science classes is to write a program that converts English to “pirate-speak.” Many
variations of the exercise exist for other semi-fictional dialects, such as “Snoop
Dogg” and “E-40.” Usually the solution involves a dictionary look-up for exact
word matches (“for” becomes “fo,” “sure” becomes “sho,” “the” becomes “da,”
etc.), simple text rules (words ending in “ing” now ends in “in',” replace the last
vowel of a word and everything after it with “izzle,” etc.), and random injections
(“kno' wha' im sayin'?”). Write such translations and use Hadoop to apply it to a
large corpus such as Wikipedia.
7
4.8
Summary
MapReduce programs follow a template. Often the whole program is defined within
a single Java class. Within the class, a driver sets up a MapReduce job's configuration
object, which is used as the blueprint for how the job is set up and run. You'll find the
map and reduce functions in subclasses of Mapper and Reducer , respectively. Those
classes are often no more than a couple dozen lines long, so they're usually written as
inner classes for convenience.
Hadoop provides a Streaming API for writing MapReduce programs in a language
other than Java. Many MapReduce programs are much easier to develop in a scripting
language using the Streaming API, especially for ad hoc data analysis. The Aggregate
package, when used with Streaming, enables one to rapidly write programs for counting
and getting basic statistics.
MapReduce programs are largely about the map and the reduce functions, but
Hadoop allows for a combiner function to improve performance by “pre-reducing”
the intermediate data at the mapper before the reduce phase.
In standard programming (outside of the MapReduce paradigm), counting,
summing, averaging, and so on are usually done through a simple, single pass of the
data. Refactoring those programs to run in MapReduce, as we've done in this chapter,
is relatively straightforward conceptually. More complex data analysis algorithms call
for deeper reworking of the algorithms, which we cover in the next chapter.
 
Search WWH ::




Custom Search