Database Reference
In-Depth Information
Data Flow
Anatomy of a File Read
To get an idea of how data flows between the client interacting with HDFS, the namenode,
and the datanodes, consider
Figure 3-2
, which shows the main sequence of events when
reading a file.
Figure 3-2. A client reading data from HDFS
The client opens the file it wishes to read by calling
open()
on the
FileSystem
object,
DistributedFileSystem
calls the namenode, using remote procedure calls (RPCs),
to determine the locations of the first few blocks in the file (step 2). For each block, the na-
menode returns the addresses of the datanodes that have a copy of that block. Furthermore,
the datanodes are sorted according to their proximity to the client (according to the topo-
logy of the cluster's network; see
Network Topology and Hadoop
). If the client is itself a
datanode (in the case of a MapReduce task, for instance), the client will read from the local
datanode if that datanode hosts a copy of the block (see also
Figure 2-2
and
Short-circuit
local reads
)
.