Setting Up a Hadoop Cluster - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Benchmarking MapReduce with TeraSort

Hadoop comes with a MapReduce program called TeraSort that does a total sort of its in-

put. [ 76 ] It is very useful for benchmarking HDFS and MapReduce together, as the full in-

put dataset is transferred through the shuffle. The three steps are: generate some random

data, perform the sort, then validate the results.

First, we generate some random data using teragen (found in the examples JAR file,

not the tests one). It runs a map-only job that generates a specified number of rows of bin-

ary data. Each row is 100 bytes long, so to generate one terabyte of data using 1,000

maps, run the following ( 10t is short for 10 trillion):

% hadoop jar \

$HADOOP_HOME/share/hadoop/mapreduce/

hadoop-mapreduce-examples-*.jar \

teragen -Dmapreduce.job.maps=1000 10t random-data

Next, run terasort :

% hadoop jar \

$HADOOP_HOME/share/hadoop/mapreduce/

hadoop-mapreduce-examples-*.jar \

terasort random-data sorted-data

The overall execution time of the sort is the metric we are interested in, but it's instructive

to watch the job's progress via the web UI ( http:// resource-manager-host :8088/ ),

where you can get a feel for how long each phase of the job takes. Adjusting the paramet-

ers mentioned in Tuning a Job is a useful exercise, too.

As a final sanity check, we validate that the data in sorted-data is, in fact, correctly sorted:

% hadoop jar \

$HADOOP_HOME/share/hadoop/mapreduce/

hadoop-mapreduce-examples-*.jar \

teravalidate sorted-data report

This command runs a short MapReduce job that performs a series of checks on the sorted

data to check whether the sort is accurate. Any errors can be found in the report/part-

r-00000 output file.

Other benchmarks

There are many more Hadoop benchmarks, but the following are widely used:

▪ TestDFSIO tests the I/O performance of HDFS. It does this by using a MapRe-

duce job as a convenient way to read or write files in parallel.

Search WWH ::

Custom Search

Home