Database Reference
In-Depth Information
the NameNode , while each slave node runs both the TaskTracker and the DataNode .
Each slave node is configured with eight map slots and six reduce slots (about one
process per core). Each map/reduce process uses 400 MB memory. The data block
size is set to 64 MB. We use the Hadoop fair scheduler* to control the total number
of map/reduce slots available for different testing jobs.
17.6.1.2 Amazon EC2 Configuration
We also used the on-demand clusters provisioned from Amazon EC2 for experiments.
Only the small instances (1EC2 compute unit, 1.7 GB memory, and 160 GB hard drive)
are used to setup the on-demand clouds. For the simplicity of configuration, one map
slot and one reduce slot share one instance. Therefore, a cluster that needs m map slots
and r reduce slots will need max{ m , r } + 1 small instances in total, with the additional
instance as the master node. The existing script in the Hadoop package is used to auto-
matically setup the required Hadoop cluster (with proper node configurations) in EC2.
17.6.1.3 Data Sets
We use a number of generators to generate three types of testing data sets for the test-
ing programs. (1) We revise the RandomWriter tool in the Hadoop package to gener-
ate random float numbers. This type of data is used by the Sort program. (2) We also
revise the RandomTextWriter tool to generate text data based on a list of 1000 words
randomly sampled from the system dictionary /usr/share/dict/words. This type of
data is used by the WordCount program and the TableJoin program. (3) The third
data set is a synthetic random graph data set, which is generated for the PageRank
program. Each line of the data set starts with a node ID and its initial PageRank,
followed by a list of node IDs representing the node's outlinks. Both the node ID and
the outlinks are randomly generated integers.
Each type of data consists of 150 1 GB files. For a specific testing task with
the predefined size of input data (the parameter M ), we will randomly choose the
required number of files from the pool to simulate input data.
17.6.1.4 Modeling Tool
As we mentioned, we will need a regression modeling method that works on the con-
straints β i ≥ 0. In experiments, we use the MATLAB ® function lsqnonneg to learn
the model, which squarely fits our goal.
17.6.2 t esting P rograms
In this section, we describe the MapReduce programs used in testing and give the
complexity of each one's reduce program, that is, the g () function. If g () is in one of
the two special cases, the simplified cost model Equation 17.10 is used.
* http: //hadoop.apache.org/docs/r1.1.1/fair_scheduler.html.
wiki.apache.org/hadoop/AmazonEC2.
http://www.mathworks.com/help/techdoc/ref/lsqnonneg.html.
Search WWH ::




Custom Search