Storing and Configuring Data with Hadoop, YARN, and ZooKeeper - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

On all nodes, you change the value of mapred.job.tracker in the file $HADOOP_PREFIX/conf/mapred-site.xml to be:

hc1nn:54311

This defines the host and port names on all servers for the Map Reduce Job Tracker server to point to the Name

Node machine.

On all nodes, check that the value of dfs.replication in the file $HADOOP_PREFIX/conf/hdfs-site.xml is set to 3.

This means that three copies of each block of data will automatically be kept by HDFS.

In the same file, ensure that the line http://localhost:50070/ for the variable dfs.http.address is changed to:

http://hc1nn:50070/

This sets the HDFS web/http address to point to the Name Node master machine hc1nn. With none of the

Hadoop servers running, you format the cluster from the Name Node server—in this instance, hc1nn:

hadoop namenode -format

At this point, a common problem can occur with Hadoop file system versioning between the name node and data

nodes. Within HDFS, there are files named VERSION that contain version numbering information that is regenerated

each time the file system is formatted, such as:

[hadoop@hc1nn dfs]$ pwd

/app/hadoop/tmp/dfs

[hadoop@hc1nn dfs]$ find . -type f -name VERSION -exec grep -H namespaceID {} \;

./data/current/VERSION:namespaceID=1244166645

./name/current/VERSION:namespaceID=1244166645

./name/previous.checkpoint/VERSION:namespaceID=1244166645

./namesecondary/current/VERSION:namespaceID=1244166645

The Linux command shown here is executed as the hadoop user searches for the VERSION files under /app/

hadoop/tmp/dfs and strips the namespace ID information out of them. If this command was executed on the Name

Node server and the Data Node servers, you would expect to see the same value 1244166645. When this versioning

gets out of step on the data nodes, an error occurs, such as follows:

ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible

namespaceIDs

While this problem seems to have two solutions, only one is viable. Although you could delete the data directory

/app/hadoop/tmp/dfs/data on the offending data node, reformat the file system, and then start the servers, this

approach will cause data loss. The second, more effective method involves editing the VERSION files on the data

nodes so that the namespace ID values match those found on the Name Node machine.

You need to ensure that your firewall will enable port access for Hadoop to communicate. When you attempt to

start the Hadoop servers, check the logs in the log directory (/usr/local/hadoop/logs).

Now, start the cluster from the name node; this time, you will start the HDFS servers using the script start-dfs.sh :

[hadoop@hc1nn logs]$ start-dfs.sh

starting namenode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-namenode-

hc1nn.out

hc1r1m2: starting datanode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-

datanode-hc1r1m2.out

Search WWH ::

Custom Search

Home