Databases Reference
In-Depth Information
The DataNode will use them all in parallel to speed up I/O. 2 You should also specify
directories in multiple drives for mapred.local.dir to speed up processing of tem-
porary data.
The default configuration for Hadoop's temporary directories, hadoop.tmp.dir,
is dependent on the user name. You should avoid having any Hadoop property that
depends on a user name, as there can be mismatches between the user name used
to submit a job and the user name used to start a Hadoop node. You should set it
to something like /home/hadoop/tmp to be independent of any user name. Another
problem with the default value of hadoop.tmp.dir is that it points to the /tmp
directory. Although that's an appropriate place for temporary storage, most default
Linux configurations have a quota on /tmp that is too small for Hadoop. Rather than
increase the quota for /tmp , it's better to point hadoop.tmp.dir to a directory that's
known to have a lot of space.
By default, HDFS
doesn't require DataNodes to have any reserved free space. In
practice, most systems have questionable stability when the amount of free space gets
too low. You should set dfs.datanode.du.reserved to reserve 1 GB of free space in a
DataNode. A DataNode will stop accepting block writes when its amount of free space
falls below the reserved amount.
Each TaskTracker
is allowed to run a configurable maximum number of map and
reduce tasks. Hadoop's default is four tasks (two map tasks and two reduce tasks).
The right number depends on many factors, although most setups call for one to
two tasks per core. You can set a quad core machine to have a maximum of six map
and reduce tasks (three each), because there will already be one task each allocated
for TaskTracker and DataNode, to make a total of eight. Similarly, you can set up
a dual quad core machine to have a maximum of fourteen map and reduce tasks.
This predicates on most MapReduce jobs being I/O heavy. You should reduce the
maximum number of tasks allowed if you expect more CPU-intensive loads.
In considering the number of tasks allowed, you should also consider the amount
of heap memory
allocated to each task. Hadoop's default of 200 MB per task is
quite underwhelming. Many setups bump up the default to 512 MB, some even at
1 GB. This is not a final property. Each job can request more (or less) heap space
per task. Be sure that you have sufficient usable memory in your machines for your
configuration parameters. Keep in mind that DataNode
and TaskTracker
each
already uses 1 GB of RAM.
Although you can set the number of reduce tasks per each individual MapReduce
job, it's desirable to have a default that works well most of the time. Hadoop's
There's been some discussion in the Hadoop forums about whether one should configure multiple hard
drives in a DataNode as RAID or JBOD. Hadoop doesn't need RAID's data redundancy because HDFS
already replicates data across machines. Furthermore, Yahoo has stated that they were able to get noticeable
performance improvement using JBOD. The stated reason is that hard drives, even of the same model,
have high variance in their speed. A RAID configuration would slow down the I/O to the slowest drive. On
the other hand, letting each drive function independently will allow each one to operate at its top speed,
making the overall throughput of the system higher.
2
 
Search WWH ::




Custom Search