Toward Optimal Resource Provisioning for Economical and Green MapReduce Computing in the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

the NameNode , while each slave node runs both the TaskTracker and the DataNode .

Each slave node is configured with eight map slots and six reduce slots (about one

process per core). Each map/reduce process uses 400 MB memory. The data block

size is set to 64 MB. We use the Hadoop fair scheduler* to control the total number

of map/reduce slots available for different testing jobs.

17.6.1.2 Amazon EC2 Configuration

We also used the on-demand clusters provisioned from Amazon EC2 for experiments.

Only the small instances (1EC2 compute unit, 1.7 GB memory, and 160 GB hard drive)

are used to setup the on-demand clouds. For the simplicity of configuration, one map

slot and one reduce slot share one instance. Therefore, a cluster that needs m map slots

and r reduce slots will need max{ m , r } + 1 small instances in total, with the additional

instance as the master node. The existing script † in the Hadoop package is used to auto-

matically setup the required Hadoop cluster (with proper node configurations) in EC2.

17.6.1.3 Data Sets

We use a number of generators to generate three types of testing data sets for the test-

ing programs. (1) We revise the RandomWriter tool in the Hadoop package to gener-

ate random float numbers. This type of data is used by the Sort program. (2) We also

revise the RandomTextWriter tool to generate text data based on a list of 1000 words

randomly sampled from the system dictionary /usr/share/dict/words. This type of

data is used by the WordCount program and the TableJoin program. (3) The third

data set is a synthetic random graph data set, which is generated for the PageRank

program. Each line of the data set starts with a node ID and its initial PageRank,

followed by a list of node IDs representing the node's outlinks. Both the node ID and

the outlinks are randomly generated integers.

Each type of data consists of 150 1 GB files. For a specific testing task with

the predefined size of input data (the parameter M ), we will randomly choose the

required number of files from the pool to simulate input data.

17.6.1.4 Modeling Tool

As we mentioned, we will need a regression modeling method that works on the con-

straints β i ≥ 0. In experiments, we use the MATLAB ® function lsqnonneg ‡ to learn

the model, which squarely fits our goal.

17.6.2 t esting P rograms

In this section, we describe the MapReduce programs used in testing and give the

complexity of each one's reduce program, that is, the g () function. If g () is in one of

the two special cases, the simplified cost model Equation 17.10 is used.

* http: //hadoop.apache.org/docs/r1.1.1/fair_scheduler.html.

† wiki.apache.org/hadoop/AmazonEC2.

‡ http://www.mathworks.com/help/techdoc/ref/lsqnonneg.html.

Search WWH ::

Custom Search

Home