Database Reference
In-Depth Information
the directories specified by the mapreduce.cluster.local.dir property, in a job-
specific subdirectory.
Before it writes to disk, the thread first divides the data into partitions corresponding to
the reducers that they will ultimately be sent to. Within each partition, the background
thread performs an in-memory sort by key, and if there is a combiner function, it is run on
the output of the sort. Running the combiner function makes for a more compact map out-
put, so there is less data to write to local disk and to transfer to the reducer.
Each time the memory buffer reaches the spill threshold, a new spill file is created, so
after the map task has written its last output record, there could be several spill files. Be-
fore the task is finished, the spill files are merged into a single partitioned and sorted out-
put file. The configuration property mapreduce.task.io.sort.factor controls
the maximum number of streams to merge at once; the default is 10.
If there are at least three spill files (set by the mapre-
duce.map.combine.minspills property), the combiner is run again before the
output file is written. Recall that combiners may be run repeatedly over the input without
affecting the final result. If there are only one or two spills, the potential reduction in map
output size is not worth the overhead in invoking the combiner, so it is not run again for
this map output.
It is often a good idea to compress the map output as it is written to disk, because doing so
makes it faster to write to disk, saves disk space, and reduces the amount of data to trans-
fer to the reducer. By default, the output is not compressed, but it is easy to enable this by
setting mapreduce.map.output.compress to true . The compression library to
use is specified by mapreduce.map.output.compress.codec ; see Compression
for more on compression formats.
The output file's partitions are made available to the reducers over HTTP. The maximum
number of worker threads used to serve the file partitions is controlled by the mapre-
duce.shuffle.max.threads property; this setting is per node manager, not per
map task. The default of 0 sets the maximum number of threads to twice the number of
processors on the machine.
The Reduce Side
Let's turn now to the reduce part of the process. The map output file is sitting on the local
disk of the machine that ran the map task (note that although map outputs always get writ-
ten to local disk, reduce outputs may not be), but now it is needed by the machine that is
about to run the reduce task for the partition. Moreover, the reduce task needs the map
output for its particular partition from several map tasks across the cluster. The map tasks
Search WWH ::




Custom Search