Database Reference
In-Depth Information
Incidentally, this explains why you have to set HADOOP_CLASSPATH to point to depend-
ent classes and libraries if you are running using the local job runner without a job JAR
( hadoop CLASSNAME ).
The task classpath
On a cluster (and this includes pseudodistributed mode), map and reduce tasks run in sep-
arate JVMs, and their classpaths are not controlled by HADOOP_CLASSPATH .
HADOOP_CLASSPATH is a client-side setting and only sets the classpath for the driver
JVM, which submits the job.
Instead, the user's task classpath is comprised of the following:
▪ The job JAR file
▪ Any JAR files contained in the lib directory of the job JAR file, and the classes
directory (if present)
▪ Any files added to the distributed cache using the -libjars option (see
Table 6-1 ), or the addFileToClassPath() method on Distrib-
utedCache (old API), or Job (new API)
Packaging dependencies
Given these different ways of controlling what is on the client and task classpaths, there
are corresponding options for including library dependencies for a job:
▪ Unpack the libraries and repackage them in the job JAR.
▪ Package the libraries in the lib directory of the job JAR.
▪ Keep the libraries separate from the job JAR, and add them to the client classpath
via HADOOP_CLASSPATH and to the task classpath via -libjars .
The last option, using the distributed cache, is simplest from a build point of view because
dependencies don't need rebundling in the job JAR. Also, using the distributed cache can
mean fewer transfers of JAR files around the cluster, since files may be cached on a node
between tasks. (You can read more about the distributed cache .)
Task classpath precedence
User JAR files are added to the end of both the client classpath and the task classpath,
which in some cases can cause a dependency conflict with Hadoop's built-in libraries if
Hadoop uses a different, incompatible version of a library that your code uses. Sometimes
you need to be able to control the task classpath order so that your classes are picked up
first. On the client side, you can force Hadoop to put the user classpath first in the search
Search WWH ::




Custom Search