Database Reference
In-Depth Information
Benchmarking MapReduce with TeraSort
Hadoop comes with a MapReduce program called
TeraSort
that does a total sort of its in-
put dataset is transferred through the shuffle. The three steps are: generate some random
data, perform the sort, then validate the results.
First, we generate some random data using
teragen
(found in the examples JAR file,
not the tests one). It runs a map-only job that generates a specified number of rows of bin-
ary data. Each row is 100 bytes long, so to generate one terabyte of data using 1,000
maps, run the following (
10t
is short for 10 trillion):
%
hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/
hadoop-mapreduce-examples-*.jar \
teragen -Dmapreduce.job.maps=1000 10t random-data
Next, run
terasort
:
%
hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/
hadoop-mapreduce-examples-*.jar \
terasort random-data sorted-data
The overall execution time of the sort is the metric we are interested in, but it's instructive
to watch the job's progress via the web UI (
http://
resource-manager-host
:8088/
),
where you can get a feel for how long each phase of the job takes. Adjusting the paramet-
ers mentioned in
Tuning a Job
is a useful exercise, too.
As a final sanity check, we validate that the data in
sorted-data
is, in fact, correctly sorted:
%
hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/
hadoop-mapreduce-examples-*.jar \
teravalidate sorted-data report
This command runs a short MapReduce job that performs a series of checks on the sorted
data to check whether the sort is accurate. Any errors can be found in the
report/part-
r-00000
output file.
Other benchmarks
There are many more Hadoop benchmarks, but the following are widely used:
▪
TestDFSIO
tests the I/O performance of HDFS. It does this by using a MapRe-
duce job as a convenient way to read or write files in parallel.