Databases Reference
In-Depth Information
Let's start with the development techniques applicable to Hadoop. Presumably
you're already familiar with standard Java software engineering techniques. We focus
on practices unique to data-centric programming within Hadoop.
6.1
Developing MapReduce programs
Chapter 2 discussed the three modes
(standalone), pseudo-distributed,
and fully distributed. They correspond roughly to development, staging, and produc-
tion setups. Your development process will go through each of the three modes. You'll
have to be able to switch between configurations easily. In practice you may even have
more than one fully distributed cluster. Larger shops may, for example, have a “devel-
opment” cluster to further harden MapReduce programs before running them on the
real production cluster. You may have multiple clusters for different workloads. For
example, there can be an in-house cluster for running many small- to medium-sized
jobs and a cluster in the cloud that's more cost effective for running large, infre-
quent jobs.
Section 2.3 discussed how you can have different versions of the hadoop-site.
xml configuration file for different setups, and you switch a symlink to point to the
configuration you want to work with at the moment. You can also specify the exact
configuration file you want at each Hadoop command with the -conf option.
For example,
of Hadoop: local
bin/hadoop fs -conf conf.cluster/hadoop-site.xml -lsr
will list all files in your fully distributed cluster, even though you may be currently work-
ing on a different mode or different cluster (assuming conf.cluster/hadoop-site.
xml is where your fully distributed cluster's configuration file is).
Before you run and test your Hadoop program, you'll need to make data available
for the configuration you're running. Section 3.1 describes various ways to get data
into and out of HDFS. For local and pseudo-distributed modes, you'll only want a
subset of your full data. Section 4.4 presents a Streaming program ( RandomSample.
py ) that can randomly sample a percentage of records from a data set in HDFS. As it's
a Python script, you can also use it to sample down a local file with a Unix pipe:
cat datafile | RandomSample.py 10
will give you a 10 percent sample of the file datafile .
Now that you have all the different configurations set up and know how to put data
into each configuration, let's look at how to develop and debug in local and pseudo-
distributed modes. The techniques build on top of each other as you get closer to the
production environment. We defer the discussion of debugging on the fully distributed
cluster 'till the next section.
6.1.1
Local mode
Hadoop in local mode runs everything within one single Java Virtual Machine (JVM)
and uses the local filesystem (i.e., no HDFS). Running within one JVM allows you to
 
Search WWH ::




Custom Search