Databases Reference
In-Depth Information
A TaskTracker is a node in the cluster that accepts tasks from a JobTracker. By design, every
TaskTracker is configured with a set of slots , which indicate the total number of tasks that it can
accept at any given point in time. The key features with which TaskTracker works are:
The TaskTracker creates and manages separate JVM processes to execute the actual work assigned
by the JobTracker. By creating a new JVM process the success or failure of any particular piece of
work remains isolated and does not affect the entire TaskTracker.
The TaskTracker monitors all the processes that were created by it for job execution and it
captures all the output and exit codes. When the process finishes execution, the JobTracker is
notified of the status.
The TaskTracker communicates periodic signals called heartbeats to the JobTracker to notify that
it is still alive. These messages additionally inform the JobTracker of the number of available
slots, so the JobTracker can be updated about the availability of nodes within the cluster where
work can be delegated.
When a TaskTracker fails executing and notifies the JobTracker, there are three possibilities that a
JobTracker can choose:
1. It can resubmit the job elsewhere in the cluster.
2. It can mark that specific record as something to avoid and not process that portion of data.
3. It can blacklist the TaskTracker as unreliable and move on.
The combination of these two processes and their management of job execution is how
Hadoop executes MapReduce processes. The JobTracker is a single point of failure for processing
MapReduce services on Hadoop. If it goes down, all executing and queued jobs are halted.
HDFS is a file system that also provides load balancing, disk management, block allocation, and
advanced file management within its design. For further details on these areas, refer to the HDFS
architecture guide on Apache's HDFS Project page ( http://hadoop.apache.org/ ) .
Based on the brief architecture discussion of HDFS in this section, we can see how Hadoop man-
ages all its data management functions, implemented through a series of API calls. Due to its file-
based architecture, HDFS achieves unlimited scalability and can deliver sustained performance when
infrastructure is expanded within the cluster.
MapReduce
MapReduce is a programming model for processing extremely large data sets and was originally
developed by Google in the early 2000s for solving the scalability of search computation. Its founda-
tions are based on principles of parallel and distributed processing without any database dependency.
The flexibility of MapReduce lies in the ability to process distributed computations on large amounts
of data on clusters of commodity servers, with simple task-based models for management of the same.
The key features of MapReduce that make it the interface on Hadoop or Cassandra include:
Automatic parallelization
Automatic distribution
Fault-tolerance
Status and monitoring tools
Easy abstraction for programmers
Programming language flexibility
Extensibility
Search WWH ::




Custom Search