Database Reference
In-Depth Information
Jobttracker
The master process for scheduling MapReduce jobs. The jobtracker accepts new jobs, breaks
them into map and reduce tasks, and assigns those tasks to tasktrackers in the cluster. It is
responsible for job completion. It often runs on the same server as the namenode in smaller
clusters.
Taskttracker
The process responsible for running map or reduce tasks from the jobtracker. Tasktrackers
run on the same servers as datanodes.
Like Cassandra, Hadoop is a distributed system. The MapReduce jobtracker spreads tasks across
the cluster, preferably near the data it needs. When a jobtracker initiates tasks, it looks to HDFS
to provide it with information about where that data is stored. Similarly, Cassandra's built-in Ha-
doop integration provides the jobtracker with data locality information so that tasks can be close
to the data.
In order to achieve this data locality, Cassandra nodes must also be part of a Hadoop cluster. The
namenode and jobtracker can reside on a server outside your Cassandra cluster. Cassandra nodes
will need to be part of the cluster by running a tasktracker process on each node. Then, when a
MapReduce job is initiated, the jobtracker can query Cassandra for locality of the data when it
splits up the map and reduce tasks.
A four-node Cassandra cluster with tasktracker processes running on each Cassandra node is
shown in Figure 12-1 . At least one node in the cluster needs to be running the datanode process.
There is a light dependency on HDFS for small amounts of data (the distributed cache), and a
single datanode should suffice. External to the cluster is the server running the Hadoop namen-
ode and jobtracker.
Search WWH ::




Custom Search