MapReduce Types and Formats - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

minimumSize < blockSize < maximumSize

so the split size is blockSize . Various settings for these parameters and how they affect

the final split size are illustrated in Table 8-6 .

Table 8-6. Examples of how to control the split size

Minimum

split size

Maximum split

size

Block

size

Split

size

Comment

1 (de-

fault)

128 MB

(default)

128

MB

By default, the split size is the same as the default block

size.

Long.MAX_VALUE

(default)

1 (de-

fault)

256 MB 256

MB

The most natural way to increase the split size is to have

larger blocks in HDFS, either by setting dfs.blocksize

or by configuring this on a per-file basis at file construc-

tion time.

Long.MAX_VALUE

(default)

256 MB Long.MAX_VALUE

(default)

128 MB

(default)

256

MB

Making the minimum split size greater than the block size

increases the split size, but at the cost of locality.

1 (de-

fault)

64 MB

128 MB

(default)

64

MB

Making the maximum split size less than the block size

decreases the split size.

Small files and CombineFileInputFormat

Hadoop works better with a small number of large files than a large number of small files.

One reason for this is that FileInputFormat generates splits in such a way that each

split is all or part of a single file. If the file is very small (“small” means significantly

smaller than an HDFS block) and there are a lot of them, each map task will process very

little input, and there will be a lot of them (one per file), each of which imposes extra

bookkeeping overhead. Compare a 1 GB file broken into eight 128 MB blocks with

10,000 or so 100 KB files. The 10,000 files use one map each, and the job time can be

tens or hundreds of times slower than the equivalent one with a single input file and eight

map tasks.

The situation is alleviated somewhat by CombineFileInputFormat , which was de-

signed to work well with small files. Where FileInputFormat creates a split per file,

CombineFileInputFormat packs many files into each split so that each mapper has

more to process. Crucially, CombineFileInputFormat takes node and rack locality

into account when deciding which blocks to place in the same split, so it does not com-

promise the speed at which it can process the input in a typical MapReduce job.

Hadoop: The Definitive Guide

Search WWH ::

Custom Search

Home