Database Reference
In-Depth Information
Benchmarking MapReduce with TeraSort
Hadoop comes with a MapReduce program called TeraSort that does a total sort of its in-
put. [ 76 ] It is very useful for benchmarking HDFS and MapReduce together, as the full in-
put dataset is transferred through the shuffle. The three steps are: generate some random
data, perform the sort, then validate the results.
First, we generate some random data using teragen (found in the examples JAR file,
not the tests one). It runs a map-only job that generates a specified number of rows of bin-
ary data. Each row is 100 bytes long, so to generate one terabyte of data using 1,000
maps, run the following ( 10t is short for 10 trillion):
% hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/
hadoop-mapreduce-examples-*.jar \
teragen -Dmapreduce.job.maps=1000 10t random-data
Next, run terasort :
% hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/
hadoop-mapreduce-examples-*.jar \
terasort random-data sorted-data
The overall execution time of the sort is the metric we are interested in, but it's instructive
to watch the job's progress via the web UI ( http:// resource-manager-host :8088/ ),
where you can get a feel for how long each phase of the job takes. Adjusting the paramet-
ers mentioned in Tuning a Job is a useful exercise, too.
As a final sanity check, we validate that the data in sorted-data is, in fact, correctly sorted:
% hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/
hadoop-mapreduce-examples-*.jar \
teravalidate sorted-data report
This command runs a short MapReduce job that performs a series of checks on the sorted
data to check whether the sort is accurate. Any errors can be found in the report/part-
r-00000 output file.
Other benchmarks
There are many more Hadoop benchmarks, but the following are widely used:
TestDFSIO tests the I/O performance of HDFS. It does this by using a MapRe-
duce job as a convenient way to read or write files in parallel.
Search WWH ::




Custom Search