Databases Reference
In-Depth Information
400 KB
1 MB
10 MB
40 KB
4 KB
Typical OS filesystem
block size is 4 KB.
The default HDFS block
size is 64 MB, or 6.4 of these.
Figure 6.9 The size difference between a filesystem block size on a typical desktop
or UNIX operating system (4 KB) and the logical block size within the Apache Hadoop
Distributed File System (64 MB), which is optimized for big data transforms. The
default block size defines a unit of work for the filesystem. The fewer blocks used in
a transfer, the more efficient the transfer process. The downside of using large blocks
is that if data doesn't fill an entire physical block, the empty section of the block can't
be used.
highly available input or output destination for gigabyte and larger MapReduce batch
jobs.
Now let's take a closer look at how MapReduce jobs work over distributed clusters.
6.7.2
How MapReduce allows efficient transformation of big data
problems
In previous chapters, we looked at MapReduce and its exceptional horizontal scale-
out properties. MapReduce is a core component in many big data solutions.
Figure 6.10 provides a detailed look at the internal components of a MapReduce job.
Input
Shuffle
Output
Map
Key-value
Reduce
Map
Key-value
Result
Reduce
Map
Key-value
Figure 6.10 The basics of how the map and reduce functions work together
to gain linear scalability over big data transforms. The map operation takes
input data and creates a uniform set of key-value pairs. In the shuffle phase,
which is done automatically by the MapReduce framework, key-value pairs
are automatically distributed to the correct reduce node based on the value
of the key. The reduce operation takes the key-value pairs and returns
consolidated values for each key. It's the job of the MapReduce framework
to get the right keys to the right reduce nodes.
Search WWH ::




Custom Search