Database Reference
In-Depth Information
Table 8-5. Properties for controlling split size
Property name
Type Default value
Description
mapreduce.input.fileinputformat.split.minsize int 1
The smal-
lest valid
size in
bytes for a
file split
long Long.MAX_VALUE (i.e.,
9223372036854775807)
The largest
valid size
in bytes for
a file split
mapreduce.input.fileinputformat.split.maxsize
[ a ]
long 128 MB (i.e.,
134217728)
The size of
a block in
HDFS in
bytes
dfs.blocksize
[ a ] This property is not present in the old MapReduce API (with the exception of CombineFileInputFormat ). Instead,
it is calculated indirectly as the size of the total input for the job, divided by the guide number of map tasks specified
by mapreduce.job.maps (or the setNumMapTasks() method on JobConf ). Because the number of map tasks defaults
to 1, this makes the maximum split size the size of the input.
The minimum split size is usually 1 byte, although some formats have a lower bound on
the split size. (For example, sequence files insert sync entries every so often in the stream,
so the minimum split size has to be large enough to ensure that every split has a sync point
to allow the reader to resynchronize with a record boundary. See Reading a
SequenceFile . )
Applications may impose a minimum split size. By setting this to a value larger than the
block size, they can force splits to be larger than a block. There is no good reason for do-
ing this when using HDFS, because doing so will increase the number of blocks that are
not local to a map task.
The maximum split size defaults to the maximum value that can be represented by a Java
long type. It has an effect only when it is less than the block size, forcing splits to be
smaller than a block.
The split size is calculated by the following formula (see the computeSplitSize()
method in FileInputFormat ):
max ( minimumSize , min ( maximumSize , blockSize ))
and by default:
Search WWH ::




Custom Search