Managing Hadoop - Hadoop in Action

Databases Reference

In-Depth Information

The DataNode will use them all in parallel to speed up I/O. 2 You should also specify

directories in multiple drives for mapred.local.dir to speed up processing of tem-

porary data.

The default configuration for Hadoop's temporary directories, hadoop.tmp.dir,

is dependent on the user name. You should avoid having any Hadoop property that

depends on a user name, as there can be mismatches between the user name used

to submit a job and the user name used to start a Hadoop node. You should set it

to something like /home/hadoop/tmp to be independent of any user name. Another

problem with the default value of hadoop.tmp.dir is that it points to the /tmp

directory. Although that's an appropriate place for temporary storage, most default

Linux configurations have a quota on /tmp that is too small for Hadoop. Rather than

increase the quota for /tmp , it's better to point hadoop.tmp.dir to a directory that's

known to have a lot of space.

By default, HDFS

doesn't require DataNodes to have any reserved free space. In

practice, most systems have questionable stability when the amount of free space gets

too low. You should set dfs.datanode.du.reserved to reserve 1 GB of free space in a

DataNode. A DataNode will stop accepting block writes when its amount of free space

falls below the reserved amount.

Each TaskTracker

is allowed to run a configurable maximum number of map and

reduce tasks. Hadoop's default is four tasks (two map tasks and two reduce tasks).

The right number depends on many factors, although most setups call for one to

two tasks per core. You can set a quad core machine to have a maximum of six map

and reduce tasks (three each), because there will already be one task each allocated

for TaskTracker and DataNode, to make a total of eight. Similarly, you can set up

a dual quad core machine to have a maximum of fourteen map and reduce tasks.

This predicates on most MapReduce jobs being I/O heavy. You should reduce the

maximum number of tasks allowed if you expect more CPU-intensive loads.

In considering the number of tasks allowed, you should also consider the amount

of heap memory

allocated to each task. Hadoop's default of 200 MB per task is

quite underwhelming. Many setups bump up the default to 512 MB, some even at

1 GB. This is not a final property. Each job can request more (or less) heap space

per task. Be sure that you have sufficient usable memory in your machines for your

configuration parameters. Keep in mind that DataNode

and TaskTracker

each

already uses 1 GB of RAM.

Although you can set the number of reduce tasks per each individual MapReduce

job, it's desirable to have a default that works well most of the time. Hadoop's

There's been some discussion in the Hadoop forums about whether one should configure multiple hard

drives in a DataNode as RAID or JBOD. Hadoop doesn't need RAID's data redundancy because HDFS

already replicates data across machines. Furthermore, Yahoo has stated that they were able to get noticeable

performance improvement using JBOD. The stated reason is that hard drives, even of the same model,

have high variance in their speed. A RAID configuration would slow down the I/O to the slowest drive. On

the other hand, letting each drive function independently will allow each one to operate at its top speed,

making the overall throughput of the system higher.

2

Hadoop in Action

Search WWH ::

Custom Search

Home