MapReduce Types and Formats - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Of course, if possible, it is still a good idea to avoid the many small files case, because

MapReduce works best when it can operate at the transfer rate of the disks in the cluster,

and processing many small files increases the number of seeks that are needed to run a

job. Also, storing large numbers of small files in HDFS is wasteful of the namenode's

memory. One technique for avoiding the many small files case is to merge small files into

larger files by using a sequence file, as in Example 8-4 ; with this approach, the keys can

act as filenames (or a constant such as NullWritable , if not needed) and the values as

file contents. But if you already have a large number of small files in HDFS, then Com-

bineFileInputFormat is worth trying.

NOTE

CombineFileInputFormat isn't just good for small files. It can bring benefits when processing

large files, too, since it will generate one split per node, which may be made up of multiple blocks. Es-

sentially, CombineFileInputFormat decouples the amount of data that a mapper consumes from

the block size of the files in HDFS.

Preventing splitting

Some applications don't want files to be split, as this allows a single mapper to process

each input file in its entirety. For example, a simple way to check if all the records in a file

are sorted is to go through the records in order, checking whether each record is not less

than the preceding one. Implemented as a map task, this algorithm will work only if one

map processes the whole file. [ 56 ]

There are a couple of ways to ensure that an existing file is not split. The first (quick-and-

dirty) way is to increase the minimum split size to be larger than the largest file in your

system. Setting it to its maximum value, Long.MAX_VALUE , has this effect. The second

is to subclass the concrete subclass of FileInputFormat that you want to use, to over-

ride the isSplitable() method [ 57 ] to return false . For example, here's a nonsplit-

table TextInputFormat :

import org.apache.hadoop.fs.Path ;

import org.apache.hadoop.mapreduce.JobContext ;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat ;

public class NonSplittableTextInputFormat extends TextInputFormat {

@Override

protected boolean isSplitable ( JobContext context , Path file ) {

return false ;

Search WWH ::

Custom Search

Home