Setting Up a Hadoop Cluster - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Configuration Management

Hadoop does not have a single, global location for configuration information. Instead,

each Hadoop node in the cluster has its own set of configuration files, and it is up to ad-

ministrators to ensure that they are kept in sync across the system. There are parallel shell

tools that can help do this, such as dsh or pdsh . This is an area where Hadoop cluster man-

agement tools like Cloudera Manager and Apache Ambari really shine, since they take

care of propagating changes across the cluster.

Hadoop is designed so that it is possible to have a single set of configuration files that are

used for all master and worker machines. The great advantage of this is simplicity, both

conceptually (since there is only one configuration to deal with) and operationally (as the

Hadoop scripts are sufficient to manage a single configuration setup).

For some clusters, the one-size-fits-all configuration model breaks down. For example, if

you expand the cluster with new machines that have a different hardware specification

from the existing ones, you need a different configuration for the new machines to take

advantage of their extra resources.

In these cases, you need to have the concept of a class of machine and maintain a separate

configuration for each class. Hadoop doesn't provide tools to do this, but there are several

excellent tools for doing precisely this type of configuration management, such as Chef,

Puppet, CFEngine, and Bcfg2.

For a cluster of any size, it can be a challenge to keep all of the machines in sync. Con-

sider what happens if the machine is unavailable when you push out an update. Who en-

sures it gets the update when it becomes available? This is a big problem and can lead to

divergent installations, so even if you use the Hadoop control scripts for managing Ha-

doop, it may be a good idea to use configuration management tools for maintaining the

cluster. These tools are also excellent for doing regular maintenance, such as patching se-

curity holes and updating system packages.

Environment Settings

In this section, we consider how to set the variables in hadoop-env.sh . There are also ana-

logous configuration files for MapReduce and YARN (but not for HDFS), called mapred-

env.sh and yarn-env.sh , where variables pertaining to those components can be set. Note

that the MapReduce and YARN files override the values set in hadoop-env.sh .

Search WWH ::

Custom Search

Home