Using NoSQL to manage big data - Making Sense of NoSQL - page 142

Databases Reference

In-Depth Information

400 KB

1 MB

10 MB

40 KB

4 KB

Typical OS filesystem

block size is 4 KB.

The default HDFS block

size is 64 MB, or 6.4 of these.

Figure 6.9 The size difference between a filesystem block size on a typical desktop

or UNIX operating system (4 KB) and the logical block size within the Apache Hadoop

Distributed File System (64 MB), which is optimized for big data transforms. The

default block size defines a unit of work for the filesystem. The fewer blocks used in

a transfer, the more efficient the transfer process. The downside of using large blocks

is that if data doesn't fill an entire physical block, the empty section of the block can't

be used.

highly available input or output destination for gigabyte and larger MapReduce batch

jobs.

Now let's take a closer look at how MapReduce jobs work over distributed clusters.

6.7.2

How MapReduce allows efficient transformation of big data

problems

In previous chapters, we looked at MapReduce and its exceptional horizontal scale-

out properties. MapReduce is a core component in many big data solutions.

Figure 6.10 provides a detailed look at the internal components of a MapReduce job.

Input

Shuffle

Output

Map

Key-value

Reduce

Map

Key-value

Result

Reduce

Map

Key-value

Figure 6.10 The basics of how the map and reduce functions work together

to gain linear scalability over big data transforms. The map operation takes

input data and creates a uniform set of key-value pairs. In the shuffle phase,

which is done automatically by the MapReduce framework, key-value pairs

are automatically distributed to the correct reduce node based on the value

of the key. The reduce operation takes the key-value pairs and returns

consolidated values for each key. It's the job of the MapReduce framework

to get the right keys to the right reduce nodes.

Next Page

Making Sense of NoSQL

Search WWH ::

Custom Search

Home