Database Reference
In-Depth Information
Configuration Management
Hadoop does not have a single, global location for configuration information. Instead,
each Hadoop node in the cluster has its own set of configuration files, and it is up to ad-
ministrators to ensure that they are kept in sync across the system. There are parallel shell
tools that can help do this, such as dsh or pdsh . This is an area where Hadoop cluster man-
agement tools like Cloudera Manager and Apache Ambari really shine, since they take
care of propagating changes across the cluster.
Hadoop is designed so that it is possible to have a single set of configuration files that are
used for all master and worker machines. The great advantage of this is simplicity, both
conceptually (since there is only one configuration to deal with) and operationally (as the
Hadoop scripts are sufficient to manage a single configuration setup).
For some clusters, the one-size-fits-all configuration model breaks down. For example, if
you expand the cluster with new machines that have a different hardware specification
from the existing ones, you need a different configuration for the new machines to take
advantage of their extra resources.
In these cases, you need to have the concept of a class of machine and maintain a separate
configuration for each class. Hadoop doesn't provide tools to do this, but there are several
excellent tools for doing precisely this type of configuration management, such as Chef,
Puppet, CFEngine, and Bcfg2.
For a cluster of any size, it can be a challenge to keep all of the machines in sync. Con-
sider what happens if the machine is unavailable when you push out an update. Who en-
sures it gets the update when it becomes available? This is a big problem and can lead to
divergent installations, so even if you use the Hadoop control scripts for managing Ha-
doop, it may be a good idea to use configuration management tools for maintaining the
cluster. These tools are also excellent for doing regular maintenance, such as patching se-
curity holes and updating system packages.
Environment Settings
In this section, we consider how to set the variables in hadoop-env.sh . There are also ana-
logous configuration files for MapReduce and YARN (but not for HDFS), called mapred-
env.sh and yarn-env.sh , where variables pertaining to those components can be set. Note
that the MapReduce and YARN files override the values set in hadoop-env.sh .
Search WWH ::




Custom Search