Database Reference
In-Depth Information
minimumSize < blockSize < maximumSize
so the split size is blockSize . Various settings for these parameters and how they affect
the final split size are illustrated in Table 8-6 .
Table 8-6. Examples of how to control the split size
Minimum
split size
Maximum split
size
Block
size
Split
size
Comment
1 (de-
fault)
128 MB
(default)
128
MB
By default, the split size is the same as the default block
size.
Long.MAX_VALUE
(default)
1 (de-
fault)
256 MB 256
MB
The most natural way to increase the split size is to have
larger blocks in HDFS, either by setting dfs.blocksize
or by configuring this on a per-file basis at file construc-
tion time.
Long.MAX_VALUE
(default)
256 MB Long.MAX_VALUE
(default)
128 MB
(default)
256
MB
Making the minimum split size greater than the block size
increases the split size, but at the cost of locality.
1 (de-
fault)
64 MB
128 MB
(default)
64
MB
Making the maximum split size less than the block size
decreases the split size.
Small files and CombineFileInputFormat
Hadoop works better with a small number of large files than a large number of small files.
One reason for this is that FileInputFormat generates splits in such a way that each
split is all or part of a single file. If the file is very small (“small” means significantly
smaller than an HDFS block) and there are a lot of them, each map task will process very
little input, and there will be a lot of them (one per file), each of which imposes extra
bookkeeping overhead. Compare a 1 GB file broken into eight 128 MB blocks with
10,000 or so 100 KB files. The 10,000 files use one map each, and the job time can be
tens or hundreds of times slower than the equivalent one with a single input file and eight
map tasks.
The situation is alleviated somewhat by CombineFileInputFormat , which was de-
signed to work well with small files. Where FileInputFormat creates a split per file,
CombineFileInputFormat packs many files into each split so that each mapper has
more to process. Crucially, CombineFileInputFormat takes node and rack locality
into account when deciding which blocks to place in the same split, so it does not com-
promise the speed at which it can process the input in a typical MapReduce job.
Search WWH ::




Custom Search