Integration with Hadoop - Mastering Apache Cassandra

Database Reference

In-Depth Information

TaskTracker

Like the DataNode in case of HDFS, the TaskTracker is the actual execution unit of Ha-

doop. It creates a child JVM for Mapper and Reducer tasks. The maximum number of

tasks (Mapper and Reducer tasks) can be set independently. TaskTracker may reuse the

child JVMs to improve efficiency.

Reliability of data and processes in Hadoop

Hadoop is a very robust and reliable architecture. It is meant to be run on commodity

hardware and hence takes care of failure automatically. It detects the failure of a task and

retries the failed tasks. It is fault tolerant. A down DataNode is replicated (redundant) and

a system heals by itself, if a DataNode is unavailable.

Hadoop allows servers to join the cluster or leave it without any repercussion. Rack-aware

storage of data saves the cluster against disk failures, rack/machine power failure, and

even a complete rack going down.

The following figure shows the famous schema of the reliable Hadoop infrastructure us-

ing commodity hardware for slaves and heavy-duty servers (top of the rack) for the mas-

ters. Please note that these are physical servers, as they are in the data centers. Later, when

we will discuss using Cassandra as a data store for Hadoop, we will use a ring representa-

tion. Even in that case, the physical configuration may be the same as the one represented

in the following figure, but the logical configuration, as we have seen throughout this

topic, will be a ring-like structure to emphasize the token distribution.

Search WWH ::

Custom Search

Home