Toward Optimal Resource Provisioning for Economical and Green MapReduce Computing in the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

the large text data set. The first word of each line in both types of file serves as the

join key. The map program emits the lines of the input large and small files. Each

line of the small file is labeled so that they can be distinguished from the map output.

In the reduce, the lines are checked to find those with matched keys. If the lines from

both files are found to be matched, a Cartesian product is applied between the two

sets of lines with the same key to generate the output. Depending on the key distribu-

tion, the size of output data may vary. In the reduce program, assume there is λ lines

from the large file and μ lines from the small file. The result of Cartesian product

is λμ lines. Since μ ≤ 50 is very small, the complexity function g () is approximately

linear to the input λ + μ lines.

17.6.3 m oDel a nalysis

We run a set of experiments to estimate the model parameters β i for the four pro-

grams. We randomly select the values for the three parameters M , m , and R . The

number of data chunks M is calculated by the number of selected 1 GB files (one file

has 1024/64 = 16 blocks). For the in-house cluster, because all available map slots

will be used in executing the MapReduce job, we control the number of map slots

m by setting the maximum number of map slots in the fair scheduler . R is randomly

set to a number smaller than the total number of reduce Slots in the system. For

on-demand EC2 clusters, it is straightforward to allocate m nodes as the map slots

and R nodes for the reduce slots.

For each tested program, we generate tens of random settings of ( M , m , R ). M is

randomly selected from the integers [1…150] × 16, that is, the number of 1 GB files ×

16 blocks/file. R is randomly selected from the integers [1…50]. Since changing m

will need to update the scheduler setting, we limit the choices of m to 30, 60, 90, and

120. For each setting, we record the time (seconds) used to finish the program. The

examples are ordered by the time cost for further analysis.

17.6.3.1 Regression Analysis

With the transformed variables (Equation 17.11), we can conduct a linear regression

on the transformed cost model

7

=+

=

∑

Tx xxxxxx

(, ,,,,,

)

β

β .

x

(17.18)

1

234567

0

11

i

1

Table 17.1 shows the result of regression analysis with the constraints β i ≥ 0 for pro-

grams running in the in-house cluster. R 2 is a measure for evaluating the goodness of

fit in regression modeling. R 2 = 1 means a perfect fit, while R 2 > 90% indicates a very

good fit. Note that the MATLAB function lsqnonneg also demotes the insignificant

β i and sets them to 0.

Table 17.1 shows most models have very high R 2 values, except for TableJoin on

AWS. The reason of lower-quality models might be caused by either the dynamic

run-time environment or the special characteristics of the program (or data) that the

model does not capture. However, the TableJoin model in the local cluster shows

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home