MapReduce - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Figure 3-4. A client writing data to HDFS

The client creates the file by calling create() on DistributedFileSystem (step

1 in Figure 3-4 ). DistributedFileSystem makes an RPC call to the namenode to

create a new file in the filesystem's namespace, with no blocks associated with it (step 2).

The namenode performs various checks to make sure the file doesn't already exist and

that the client has the right permissions to create the file. If these checks pass, the namen-

ode makes a record of the new file; otherwise, file creation fails and the client is thrown

an IOException . The DistributedFileSystem returns an FSDataOut-

putStream for the client to start writing data to. Just as in the read case, FSDataOut-

putStream wraps a DFSOutputStream , which handles communication with the

datanodes and namenode.

As the client writes data (step 3), the DFSOutputStream splits it into packets, which it

writes to an internal queue called the data queue . The data queue is consumed by the

DataStreamer , which is responsible for asking the namenode to allocate new blocks

by picking a list of suitable datanodes to store the replicas. The list of datanodes forms a

pipeline, and here we'll assume the replication level is three, so there are three nodes in

Search WWH ::

Custom Search

Home