Database Reference
In-Depth Information
munication with the master node using some sort of heartbeat mechanism. This enables
the master node to replicate the data if one of the slaves dies.
Data never goes via NameNode. DataNodes are the ones responsible for streaming the
data out. NameNode and DataNodes work in harmony to provide a scalable and giant vir-
tual filesystem that is oblivious to the underlying hardware or the operating system. The
way data read or write takes place is as follows:
• The client makes a write request for a block of a file to the master, the NameNode
server.
• NameNode returns a list of servers that the block is copied to (in a replicated
manner, a block is copied at many places as replication is configured).
• The client makes a ready request to one of the to-be-written-on DataNodes. This
node forwards the request to the next node, which will forward it to the next, until
all the nodes to write the data on acknowledge the client with an OK message.
• On receipt of the OK message, the client starts to stream the data to one of the
data nodes that internally streams the data to the next replica node and so on.
• Once the block gets written successfully, slaves notify the master. The slave con-
nected to the client returns a success.
The preceding figure shows the data flow when a Hadoop client (CLI or Java) makes a re-
quest to write a block to HDFS.
Hadoop MapReduce
MapReduce (MR) is a very simple concept once you know it. It is algorithm 101: divide
and conquer. The job is broken into small independent tasks and distributed across mul-
tiple machines. The result gets sorted and merged together to generate the final result. The
ability to distribute a large computational burden over multiple servers into a small com-
putational load provides a Hadoop programmer an effectively limitless computation cap-
ability. MR is the processing part of Hadoop; it virtualizes the CPU. The following figure
depicts this process.
As an end user, you need to write a Mapper and a Reducer for the tasks you need to get
done. The Hadoop framework performs the heavy lifting of getting data from a source and
splitting it into maps of keys and values based on what the data source is. It may be a line
from a text file, a row from a relational database, or a key-value from Cassandra's column
family. These maps of key-value pairs (indicated as Input key-val pairs in the following
figure) are forwarded to the Mapper that you have provided to Hadoop. Mapper performs
unit tasks of the key-value pair; for example, for a word count task, you may want to re-
Search WWH ::




Custom Search