Database Reference
In-Depth Information
the large text data set. The first word of each line in both types of file serves as the
join key. The map program emits the lines of the input large and small files. Each
line of the small file is labeled so that they can be distinguished from the map output.
In the reduce, the lines are checked to find those with matched keys. If the lines from
both files are found to be matched, a Cartesian product is applied between the two
sets of lines with the same key to generate the output. Depending on the key distribu-
tion, the size of output data may vary. In the reduce program, assume there is λ lines
from the large file and μ lines from the small file. The result of Cartesian product
is λμ lines. Since μ ≤ 50 is very small, the complexity function g () is approximately
linear to the input λ + μ lines.
17.6.3 m oDel a nalysis
We run a set of experiments to estimate the model parameters β i for the four pro-
grams. We randomly select the values for the three parameters M , m , and R . The
number of data chunks M is calculated by the number of selected 1 GB files (one file
has 1024/64 = 16 blocks). For the in-house cluster, because all available map slots
will be used in executing the MapReduce job, we control the number of map slots
m by setting the maximum number of map slots in the fair scheduler . R is randomly
set to a number smaller than the total number of reduce Slots in the system. For
on-demand EC2 clusters, it is straightforward to allocate m nodes as the map slots
and R nodes for the reduce slots.
For each tested program, we generate tens of random settings of ( M , m , R ). M is
randomly selected from the integers [1…150] × 16, that is, the number of 1 GB files ×
16 blocks/file. R is randomly selected from the integers [1…50]. Since changing m
will need to update the scheduler setting, we limit the choices of m to 30, 60, 90, and
120. For each setting, we record the time (seconds) used to finish the program. The
examples are ordered by the time cost for further analysis.
17.6.3.1 Regression Analysis
With the transformed variables (Equation 17.11), we can conduct a linear regression
on the transformed cost model
7
=+
=
Tx xxxxxx
(, ,,,,,
)
β
β .
x
(17.18)
1
234567
0
11
i
1
Table 17.1 shows the result of regression analysis with the constraints β i ≥ 0 for pro-
grams running in the in-house cluster. R 2 is a measure for evaluating the goodness of
fit in regression modeling. R 2 = 1 means a perfect fit, while R 2 > 90% indicates a very
good fit. Note that the MATLAB function lsqnonneg also demotes the insignificant
β i and sets them to 0.
Table 17.1 shows most models have very high R 2 values, except for TableJoin on
AWS. The reason of lower-quality models might be caused by either the dynamic
run-time environment or the special characteristics of the program (or data) that the
model does not capture. However, the TableJoin model in the local cluster shows
Search WWH ::




Custom Search