Databases Reference
In-Depth Information
Minyi Guo, Qianni Deng, and Song Guo. The paper is catalogued online at http://portal.acm
.org/citation.cfm?id=1901325 .
ADDITIONAL MAPREDUCE TUNING
A number of confi guration parameters that affect MapReduce can be confi gured appropriately to
achieve better performance.
Communication Overheads
When the data sets are too large the algorithmic complexity of MapReduce is the least of the
concerns. The focus is often on processing the large data set in the fi rst place. However, you must
bear in mind that some of the communication overhead and the associated algorithmic complexity
can be minimized by simply getting rid of the reduce task if possible. In such cases, map does
everything. In cases where eliminating the reduce task is not an option, launching the reduce tasks
before all map tasks have completed can improve performance.
Compression
Compressing data as it gets transmitted between nodes and between map and reduce jobs improves
performance dramatically. Essentially, the communication overhead is reduced and avoidable
bandwidth and network usage is removed. For large clusters and large jobs, compression can lead to
substantial benefi ts.
Some data sets aren't easily compressible or do not compress enough to provide
substantial benefi ts.
Turning compression on is as simple as setting a single confi guration parameter to true. This single
parameter is:
mapred.compress.map.output
The compression codec can also be confi gured. Use mapred.map.output.compression.codec to
confi gure the codec.
LZO is a compression algorithm that is suitable for real-time compression.
It favors speed over compression ratio. Read more about LZO at
www.oberhumer.com/opensource/lzo/ .
A further improvement could be to use splittable LZO. Most MapReduce tasks are I/O bound. If
fi les on HDFS are compressed into a format that can be split and consumed by the MapReduce
Search WWH ::




Custom Search