Database Reference
In-Depth Information
Of course, if possible, it is still a good idea to avoid the many small files case, because
MapReduce works best when it can operate at the transfer rate of the disks in the cluster,
and processing many small files increases the number of seeks that are needed to run a
job. Also, storing large numbers of small files in HDFS is wasteful of the namenode's
memory. One technique for avoiding the many small files case is to merge small files into
larger files by using a sequence file, as in Example 8-4 ; with this approach, the keys can
act as filenames (or a constant such as NullWritable , if not needed) and the values as
file contents. But if you already have a large number of small files in HDFS, then Com-
bineFileInputFormat is worth trying.
NOTE
CombineFileInputFormat isn't just good for small files. It can bring benefits when processing
large files, too, since it will generate one split per node, which may be made up of multiple blocks. Es-
sentially, CombineFileInputFormat decouples the amount of data that a mapper consumes from
the block size of the files in HDFS.
Preventing splitting
Some applications don't want files to be split, as this allows a single mapper to process
each input file in its entirety. For example, a simple way to check if all the records in a file
are sorted is to go through the records in order, checking whether each record is not less
than the preceding one. Implemented as a map task, this algorithm will work only if one
map processes the whole file. [ 56 ]
There are a couple of ways to ensure that an existing file is not split. The first (quick-and-
dirty) way is to increase the minimum split size to be larger than the largest file in your
system. Setting it to its maximum value, Long.MAX_VALUE , has this effect. The second
is to subclass the concrete subclass of FileInputFormat that you want to use, to over-
ride the isSplitable() method [ 57 ] to return false . For example, here's a nonsplit-
table TextInputFormat :
import org.apache.hadoop.fs.Path ;
import org.apache.hadoop.mapreduce.JobContext ;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat ;
public class NonSplittableTextInputFormat extends TextInputFormat {
@Override
protected boolean isSplitable ( JobContext context , Path file ) {
return false ;
Search WWH ::




Custom Search