MapReduce Features - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Side Data Distribution

Side data can be defined as extra read-only data needed by a job to process the main data-

set. The challenge is to make side data available to all the map or reduce tasks (which are

spread across the cluster) in a convenient and efficient fashion.

Using the Job Configuration

You can set arbitrary key-value pairs in the job configuration using the various setter meth-

ods on Configuration (or JobConf in the old MapReduce API). This is very useful

when you need to pass a small piece of metadata to your tasks.

In the task, you can retrieve the data from the configuration returned by Context 's

getConfiguration() method. (In the old API, it's a little more involved: override the

configure() method in the Mapper or Reducer and use a getter method on the

JobConf object passed in to retrieve the data. It's very common to store the data in an in-

stance field so it can be used in the map() or reduce() method.)

Usually a primitive type is sufficient to encode your metadata, but for arbitrary objects you

can either handle the serialization yourself (if you have an existing mechanism for turning

objects to strings and back) or use Hadoop's Stringifier class. The De-

faultStringifier uses Hadoop's serialization framework to serialize objects (see

Serialization ) .

You shouldn't use this mechanism for transferring more than a few kilobytes of data, be-

cause it can put pressure on the memory usage in MapReduce components. The job config-

uration is always read by the client, the application master, and the task JVM, and each

time the configuration is read, all of its entries are read into memory, even if they are not

used.

Distributed Cache

Rather than serializing side data in the job configuration, it is preferable to distribute data-

sets using Hadoop's distributed cache mechanism. This provides a service for copying files

and archives to the task nodes in time for the tasks to use them when they run. To save net-

work bandwidth, files are normally copied to any particular node once per job.

Usage

For tools that use GenericOptionsParser (this includes many of the programs in this

book; see GenericOptionsParser, Tool, and ToolRunner ), you can specify the files to be dis-

tributed as a comma-separated list of URIs as the argument to the -files option. Files

Search WWH ::

Custom Search

Home