Setting Up a Hadoop Cluster - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

▪ MRBench (invoked with mrbench ) runs a small job a number of times. It acts as

a good counterpoint to TeraSort, as it checks whether small job runs are respons-

ive.

▪ NNBench (invoked with nnbench ) is useful for load-testing namenode hard-

ware.

▪ Gridmix is a suite of benchmarks designed to model a realistic cluster workload

by mimicking a variety of data-access patterns seen in practice. See the docu-

mentation in the distribution for how to run Gridmix.

▪ SWIM , or the Statistical Workload Injector for MapReduce , is a repository of real-

life MapReduce workloads that you can use to generate representative test work-

loads for your system.

▪ TPCx-HS is a standardized benchmark based on TeraSort from the Transaction

Processing Performance Council.

User Jobs

For tuning, it is best to include a few jobs that are representative of the jobs that your

users run, so your cluster is tuned for these and not just for the standard benchmarks. If

this is your first Hadoop cluster and you don't have any user jobs yet, then either Gridmix

or SWIM is a good substitute.

When running your own jobs as benchmarks, you should select a dataset for your user

jobs and use it each time you run the benchmarks to allow comparisons between runs.

When you set up a new cluster or upgrade a cluster, you will be able to use the same data-

set to compare the performance with previous runs.

[ 68 ] ECC memory is strongly recommended, as several Hadoop users have reported seeing many checksum

errors when using non-ECC memory on Hadoop clusters.

[ 69 ] The mapred user doesn't use SSH, as in Hadoop 2 and later, the only MapReduce daemon is the job

history server.

[ 70 ] See its man page for instructions on how to start ssh-agent .

[ 71 ] There can be more than one namenode when running HDFS HA.

[ 72 ] For more discussion on the security implications of SSH host keys, consult the article “SSH Host Key

Protection” by Brian Hatch.

[ 73 ] Notice that there is no site file for MapReduce shown here. This is because the only MapReduce dae-

mon is the job history server, and the defaults are sufficient.

Search WWH ::

Custom Search

Home