Integration with Hadoop - Mastering Apache Cassandra

Database Reference

In-Depth Information

munication with the master node using some sort of heartbeat mechanism. This enables

the master node to replicate the data if one of the slaves dies.

Data never goes via NameNode. DataNodes are the ones responsible for streaming the

data out. NameNode and DataNodes work in harmony to provide a scalable and giant vir-

tual filesystem that is oblivious to the underlying hardware or the operating system. The

way data read or write takes place is as follows:

• The client makes a write request for a block of a file to the master, the NameNode

server.

• NameNode returns a list of servers that the block is copied to (in a replicated

manner, a block is copied at many places as replication is configured).

• The client makes a ready request to one of the to-be-written-on DataNodes. This

node forwards the request to the next node, which will forward it to the next, until

all the nodes to write the data on acknowledge the client with an OK message.

• On receipt of the OK message, the client starts to stream the data to one of the

data nodes that internally streams the data to the next replica node and so on.

• Once the block gets written successfully, slaves notify the master. The slave con-

nected to the client returns a success.

The preceding figure shows the data flow when a Hadoop client (CLI or Java) makes a re-

quest to write a block to HDFS.

Hadoop MapReduce

MapReduce (MR) is a very simple concept once you know it. It is algorithm 101: divide

and conquer. The job is broken into small independent tasks and distributed across mul-

tiple machines. The result gets sorted and merged together to generate the final result. The

ability to distribute a large computational burden over multiple servers into a small com-

putational load provides a Hadoop programmer an effectively limitless computation cap-

ability. MR is the processing part of Hadoop; it virtualizes the CPU. The following figure

depicts this process.

As an end user, you need to write a Mapper and a Reducer for the tasks you need to get

done. The Hadoop framework performs the heavy lifting of getting data from a source and

splitting it into maps of keys and values based on what the data source is. It may be a line

from a text file, a row from a relational database, or a key-value from Cassandra's column

family. These maps of key-value pairs (indicated as Input key-val pairs in the following

figure) are forwarded to the Mapper that you have provided to Hadoop. Mapper performs

unit tasks of the key-value pair; for example, for a word count task, you may want to re-

Search WWH ::

Custom Search

Home