MapReduce - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Data Flow

Anatomy of a File Read

To get an idea of how data flows between the client interacting with HDFS, the namenode,

and the datanodes, consider Figure 3-2 , which shows the main sequence of events when

reading a file.

Figure 3-2. A client reading data from HDFS

The client opens the file it wishes to read by calling open() on the FileSystem object,

which for HDFS is an instance of DistributedFileSystem (step 1 in Figure 3-2 ).

DistributedFileSystem calls the namenode, using remote procedure calls (RPCs),

to determine the locations of the first few blocks in the file (step 2). For each block, the na-

menode returns the addresses of the datanodes that have a copy of that block. Furthermore,

the datanodes are sorted according to their proximity to the client (according to the topo-

logy of the cluster's network; see Network Topology and Hadoop ). If the client is itself a

datanode (in the case of a MapReduce task, for instance), the client will read from the local

datanode if that datanode hosts a copy of the block (see also Figure 2-2 and Short-circuit

local reads ) .

Search WWH ::

Custom Search

Home