Developing a MapReduce Application - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Incidentally, this explains why you have to set HADOOP_CLASSPATH to point to depend-

ent classes and libraries if you are running using the local job runner without a job JAR

( hadoop CLASSNAME ).

The task classpath

On a cluster (and this includes pseudodistributed mode), map and reduce tasks run in sep-

arate JVMs, and their classpaths are not controlled by HADOOP_CLASSPATH .

HADOOP_CLASSPATH is a client-side setting and only sets the classpath for the driver

JVM, which submits the job.

Instead, the user's task classpath is comprised of the following:

▪ The job JAR file

▪ Any JAR files contained in the lib directory of the job JAR file, and the classes

directory (if present)

▪ Any files added to the distributed cache using the -libjars option (see

Table 6-1 ), or the addFileToClassPath() method on Distrib-

utedCache (old API), or Job (new API)

Packaging dependencies

Given these different ways of controlling what is on the client and task classpaths, there

are corresponding options for including library dependencies for a job:

▪ Unpack the libraries and repackage them in the job JAR.

▪ Package the libraries in the lib directory of the job JAR.

▪ Keep the libraries separate from the job JAR, and add them to the client classpath

via HADOOP_CLASSPATH and to the task classpath via -libjars .

The last option, using the distributed cache, is simplest from a build point of view because

dependencies don't need rebundling in the job JAR. Also, using the distributed cache can

mean fewer transfers of JAR files around the cluster, since files may be cached on a node

between tasks. (You can read more about the distributed cache .)

Task classpath precedence

User JAR files are added to the end of both the client classpath and the task classpath,

which in some cases can cause a dependency conflict with Hadoop's built-in libraries if

Hadoop uses a different, incompatible version of a library that your code uses. Sometimes

you need to be able to control the task classpath order so that your classes are picked up

first. On the client side, you can force Hadoop to put the user classpath first in the search

Search WWH ::

Custom Search

Home