Introducing Big Data Technologies - Data Warehousing in the Age of Big Data

Databases Reference

In-Depth Information

A TaskTracker is a node in the cluster that accepts tasks from a JobTracker. By design, every

TaskTracker is configured with a set of slots , which indicate the total number of tasks that it can

accept at any given point in time. The key features with which TaskTracker works are:

●

The TaskTracker creates and manages separate JVM processes to execute the actual work assigned

by the JobTracker. By creating a new JVM process the success or failure of any particular piece of

work remains isolated and does not affect the entire TaskTracker.

●

The TaskTracker monitors all the processes that were created by it for job execution and it

captures all the output and exit codes. When the process finishes execution, the JobTracker is

notified of the status.

●

The TaskTracker communicates periodic signals called heartbeats to the JobTracker to notify that

it is still alive. These messages additionally inform the JobTracker of the number of available

slots, so the JobTracker can be updated about the availability of nodes within the cluster where

work can be delegated.

●

When a TaskTracker fails executing and notifies the JobTracker, there are three possibilities that a

JobTracker can choose:

1. It can resubmit the job elsewhere in the cluster.

2. It can mark that specific record as something to avoid and not process that portion of data.

3. It can blacklist the TaskTracker as unreliable and move on.

The combination of these two processes and their management of job execution is how

Hadoop executes MapReduce processes. The JobTracker is a single point of failure for processing

MapReduce services on Hadoop. If it goes down, all executing and queued jobs are halted.

HDFS is a file system that also provides load balancing, disk management, block allocation, and

advanced file management within its design. For further details on these areas, refer to the HDFS

architecture guide on Apache's HDFS Project page ( http://hadoop.apache.org/ ) .

Based on the brief architecture discussion of HDFS in this section, we can see how Hadoop man-

ages all its data management functions, implemented through a series of API calls. Due to its file-

based architecture, HDFS achieves unlimited scalability and can deliver sustained performance when

infrastructure is expanded within the cluster.

MapReduce

MapReduce is a programming model for processing extremely large data sets and was originally

developed by Google in the early 2000s for solving the scalability of search computation. Its founda-

tions are based on principles of parallel and distributed processing without any database dependency.

The flexibility of MapReduce lies in the ability to process distributed computations on large amounts

of data on clusters of commodity servers, with simple task-based models for management of the same.

The key features of MapReduce that make it the interface on Hadoop or Cassandra include:

● Automatic parallelization

● Automatic distribution

● Fault-tolerance

● Status and monitoring tools

● Easy abstraction for programmers

● Programming language flexibility

● Extensibility

Data Warehousing in the Age of Big Data

Search WWH ::

Custom Search

Home