Integrating Hadoop - Cassandra: The Definitive Guide

Database Reference

In-Depth Information

Jobttracker

The master process for scheduling MapReduce jobs. The jobtracker accepts new jobs, breaks

them into map and reduce tasks, and assigns those tasks to tasktrackers in the cluster. It is

responsible for job completion. It often runs on the same server as the namenode in smaller

clusters.

Taskttracker

The process responsible for running map or reduce tasks from the jobtracker. Tasktrackers

run on the same servers as datanodes.

Like Cassandra, Hadoop is a distributed system. The MapReduce jobtracker spreads tasks across

the cluster, preferably near the data it needs. When a jobtracker initiates tasks, it looks to HDFS

to provide it with information about where that data is stored. Similarly, Cassandra's built-in Ha-

doop integration provides the jobtracker with data locality information so that tasks can be close

to the data.

In order to achieve this data locality, Cassandra nodes must also be part of a Hadoop cluster. The

namenode and jobtracker can reside on a server outside your Cassandra cluster. Cassandra nodes

will need to be part of the cluster by running a tasktracker process on each node. Then, when a

MapReduce job is initiated, the jobtracker can query Cassandra for locality of the data when it

splits up the map and reduce tasks.

A four-node Cassandra cluster with tasktracker processes running on each Cassandra node is

shown in Figure 12-1 . At least one node in the cluster needs to be running the datanode process.

There is a light dependency on HDFS for small amounts of data (the distributed cache), and a

single datanode should suffice. External to the cluster is the server running the Hadoop namen-

ode and jobtracker.

Search WWH ::

Custom Search

Home