Programming Practices - Hadoop in Action

Databases Reference

In-Depth Information

Let's start with the development techniques applicable to Hadoop. Presumably

you're already familiar with standard Java software engineering techniques. We focus

on practices unique to data-centric programming within Hadoop.

6.1

Developing MapReduce programs

Chapter 2 discussed the three modes

(standalone), pseudo-distributed,

and fully distributed. They correspond roughly to development, staging, and produc-

tion setups. Your development process will go through each of the three modes. You'll

have to be able to switch between configurations easily. In practice you may even have

more than one fully distributed cluster. Larger shops may, for example, have a “devel-

opment” cluster to further harden MapReduce programs before running them on the

real production cluster. You may have multiple clusters for different workloads. For

example, there can be an in-house cluster for running many small- to medium-sized

jobs and a cluster in the cloud that's more cost effective for running large, infre-

quent jobs.

Section 2.3 discussed how you can have different versions of the hadoop-site.

xml configuration file for different setups, and you switch a symlink to point to the

configuration you want to work with at the moment. You can also specify the exact

configuration file you want at each Hadoop command with the -conf option.

For example,

of Hadoop: local

bin/hadoop fs -conf conf.cluster/hadoop-site.xml -lsr

will list all files in your fully distributed cluster, even though you may be currently work-

ing on a different mode or different cluster (assuming conf.cluster/hadoop-site.

xml is where your fully distributed cluster's configuration file is).

Before you run and test your Hadoop program, you'll need to make data available

for the configuration you're running. Section 3.1 describes various ways to get data

into and out of HDFS. For local and pseudo-distributed modes, you'll only want a

subset of your full data. Section 4.4 presents a Streaming program ( RandomSample.

py ) that can randomly sample a percentage of records from a data set in HDFS. As it's

a Python script, you can also use it to sample down a local file with a Unix pipe:

cat datafile | RandomSample.py 10

will give you a 10 percent sample of the file datafile .

Now that you have all the different configurations set up and know how to put data

into each configuration, let's look at how to develop and debug in local and pseudo-

distributed modes. The techniques build on top of each other as you get closer to the

production environment. We defer the discussion of debugging on the fully distributed

cluster 'till the next section.

6.1.1

Local mode

Hadoop in local mode runs everything within one single Java Virtual Machine (JVM)

and uses the local filesystem (i.e., no HDFS). Running within one JVM allows you to

Hadoop in Action

Search WWH ::

Custom Search

Home