Developing a MapReduce Application - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Chapter 6. Developing a MapReduce

Application

In Chapter 2 , we introduced the MapReduce model. In this chapter, we look at the practical

aspects of developing a MapReduce application in Hadoop.

Writing a program in MapReduce follows a certain pattern. You start by writing your map

and reduce functions, ideally with unit tests to make sure they do what you expect. Then

you write a driver program to run a job, which can run from your IDE using a small subset

of the data to check that it is working. If it fails, you can use your IDE's debugger to find

the source of the problem. With this information, you can expand your unit tests to cover

this case and improve your mapper or reducer as appropriate to handle such input correctly.

When the program runs as expected against the small dataset, you are ready to unleash it on

a cluster. Running against the full dataset is likely to expose some more issues, which you

can fix as before, by expanding your tests and altering your mapper or reducer to handle

the new cases. Debugging failing programs in the cluster is a challenge, so we'll look at

some common techniques to make it easier.

After the program is working, you may wish to do some tuning, first by running through

some standard checks for making MapReduce programs faster and then by doing task pro-

filing. Profiling distributed programs is not easy, but Hadoop has hooks to aid in the pro-

cess.

Before we start writing a MapReduce program, however, we need to set up and configure

the development environment. And to do that, we need to learn a bit about how Hadoop

does configuration.

Search WWH ::

Custom Search

Home