Database Reference
In-Depth Information
Side Data Distribution
Side data can be defined as extra read-only data needed by a job to process the main data-
set. The challenge is to make side data available to all the map or reduce tasks (which are
spread across the cluster) in a convenient and efficient fashion.
Using the Job Configuration
You can set arbitrary key-value pairs in the job configuration using the various setter meth-
ods on Configuration (or JobConf in the old MapReduce API). This is very useful
when you need to pass a small piece of metadata to your tasks.
In the task, you can retrieve the data from the configuration returned by Context 's
getConfiguration() method. (In the old API, it's a little more involved: override the
configure() method in the Mapper or Reducer and use a getter method on the
JobConf object passed in to retrieve the data. It's very common to store the data in an in-
stance field so it can be used in the map() or reduce() method.)
Usually a primitive type is sufficient to encode your metadata, but for arbitrary objects you
can either handle the serialization yourself (if you have an existing mechanism for turning
objects to strings and back) or use Hadoop's Stringifier class. The De-
faultStringifier uses Hadoop's serialization framework to serialize objects (see
Serialization ) .
You shouldn't use this mechanism for transferring more than a few kilobytes of data, be-
cause it can put pressure on the memory usage in MapReduce components. The job config-
uration is always read by the client, the application master, and the task JVM, and each
time the configuration is read, all of its entries are read into memory, even if they are not
used.
Distributed Cache
Rather than serializing side data in the job configuration, it is preferable to distribute data-
sets using Hadoop's distributed cache mechanism. This provides a service for copying files
and archives to the task nodes in time for the tasks to use them when they run. To save net-
work bandwidth, files are normally copied to any particular node once per job.
Usage
For tools that use GenericOptionsParser (this includes many of the programs in this
book; see GenericOptionsParser, Tool, and ToolRunner ), you can specify the files to be dis-
tributed as a comma-separated list of URIs as the argument to the -files option. Files
Search WWH ::




Custom Search