Database Reference
In-Depth Information
Running on a Cluster
Now that we are happy with the program running on a small test dataset, we are ready to
try it on the full dataset on a Hadoop cluster.
Chapter 10
covers how to set up a fully dis-
tributed cluster, although you can also work through this section on a pseudo-distributed
cluster.
Packaging a Job
The local job runner uses a single JVM to run a job, so as long as all the classes that your
job needs are on its classpath, then things will just work.
In a distributed setting, things are a little more complex. For a start, a job's classes must be
packaged into a
job JAR file
to send to the cluster. Hadoop will find the job JAR automatic-
ally by searching for the JAR on the driver's classpath that contains the class set in the
setJarByClass()
method (on
JobConf
or
Job
). Alternatively, if you want to set an
explicit JAR file by its file path, you can use the
setJar()
method. (The JAR file path
may be local or an HDFS file path.)
Creating a job JAR file is conveniently achieved using a build tool such as Ant or Maven.
Given the POM in
Example 6-3
,
the following Maven command will create a JAR file
called
hadoop-examples.jar
in the project directory containing all of the compiled classes:
%
mvn package -DskipTests
If you have a single job per JAR, you can specify the main class to run in the JAR file's
manifest. If the main class is not in the manifest, it must be specified on the command line
(as we will see shortly when we run the job).
Any dependent JAR files can be packaged in a
lib
subdirectory in the job JAR file, al-
though there are other ways to include dependencies, discussed later. Similarly, resource
files can be packaged in a
classes
subdirectory. (This is analogous to a Java
Web applica-
tion archive
, or WAR, file, except in that case the JAR files go in a
WEB-INF/lib
subdirect-
ory and classes go in a
WEB-INF/classes
subdirectory in the WAR file.)
The client classpath
The user's client-side classpath set by
hadoop jar
<jar>
is made up of:
▪ The job JAR file
▪ Any JAR files in the
lib
directory of the job JAR file, and the
classes
directory (if
present)
▪ The classpath defined by
HADOOP_CLASSPATH
, if set