Operational Big Data Management - Microsoft Big Data Solutions

Database Reference

In-Depth Information

service status hmonitor-namenode-monitor

An important concept to understand is that Hadoop has significant

high-availability and robustness features built in to it to withstand

unscheduled downtime. Hadoop is built with the understanding that

hardware does fail. One of the architectural goals of the Hadoop Distributed

File System (HDFS) is the automatic recovery from any failures. With

hundreds of servers in your big data solution, there will always be some

nonfunctionalhardware,andthearchitectureofHDFSisintendedtohandle

these failures gracefully while repairs are made. The three places of concern

are DataNode failures, NameNode failures, and network partitions.

A DataNode may become unresponsive for any number of reasons. It might

simply have a hardware failure such as a motherboard failure, it could

have a replica become corrupted, and hard drives will fail, along with many

other reasons equipment fails. Each DataNode sends a heartbeat to the

NameNode onaregular basis. IftheNameNode doesnotreceive aheartbeat

message, it marks the DataNode as dead and does not send any new data

requests to the DataNode. Any data that was on that DataNode is not

available to HDFS anymore, and now that data's replication factor will likely

be below that specified, which will kick off re-replication of that data to

another DataNode.

Another reason Hadoop clusters become nonresponsive is due to

NameNode failures. NameNodes are most likely to fail because of

misconfiguration and network issues. This is similar to our collective

experience with Windows cluster failovers, where it is usually not hardware

issues that cause failovers but external issues such as Active Directory

problems, networking, or other configuration issues. Pay heed to these

external influences as you diagnose failures within your environment.

In this section we examined some basics into planning for high availability.

In the next section, we will dive a bit into what happens if something does

go wrong—preparing for and dealing with disaster recovery.

Disaster Recovery

By now, you should be aware that Hadoop has many high-availability

features built in to it to prevent failures. As you probably know by this

point in your career, despite all the features included in products and all

the planning we can do, disasters happen. When they do, your service level

Search WWH ::

Custom Search

Home