Database Reference
In-Depth Information
Here, we simplify the analysis by assuming the amount of data is proportional to the
number of keys assigned to the reduce. In practice, many applications have skewed
data distributions, that is, some keys may have more records while other may have less,
which may affect the quality of modeling.
The copy cost is linear to b R , denoted as c ( b R ). However, most of the time is
overlapped with the map phase. Normally only the last few rounds of map process-
ing may contribute to the overall time cost. We thus approximate the cost as c ( b R ) ~
m   o m ( b ) k / R .
A merge process follows to merge the M shares from the map results. Because the
records are already sorted by the key, this process simply merges the shares by the
key in multiple rounds. Assume the buffer size is B , the merge round i will generate
M / B i files, and its cost is proportional to b R . The total number of rounds is ⌈log B M
rceil ⌉. Thus, the total merge cost ms ( b R ) is proportional to b R ⌈log B M ⌉.
The reduce program will process the data with some complexity g ( b R ) that
depends on the specific application. Assume the output data of the reduce program
has an amount o r ( b R ), which is often less than b R . Finally, the result is duplicated
and written back to multiple nodes, with the complexity linear to o r ( b R ), denoted as
wr ( o r ( b R )).
In summary, the cost of the reduce process is the sum of the component costs,
Φ r = c ( b R ) + ms ( b R ) + g ( b R ) + wr ( o r ( b R )) + ϵ r ,
(17. 5)
Both the Copy and the WriteResult costs may vary because of the varying network
I/O performance, which are modeled with the random variable ϵ r . Similar to ϵ m for
the map phase, ϵ r has a mean zero and some variance σ 2 . These variances should be
captured in modeling.
If we model Φ r with m , M , r , R , and keep the relevant components for each phase,
we have
Φ r ( m , M , r , R ) = λ 1 ( m / R ) + λ 2 ( M log M / R ) + g ( M / R ) + λ 3 ( M / R ) + ϵ r ,
(17.6)
where λ 1 and the distribution of ϵ r are application-specific constants.
17.3.4 P utting i t a ll t ogether
According to the parallel execution model we described in Figure 17.2, the overall
time complexity T depends on the number of map rounds and reduce rounds. The
cost of managing and scheduling the map and reduce processes Θ( M , R ) = ξ 1 M + ξ 2 R
is linear to M and R , as stated in the documentation [22]. By assuming all the pro-
cesses in each map (or reduce) round finish around the same time, we can represent
the overall cost as
=
M
m
+
R
r
T
Φ
Φ Θ(,) .
+
MR
(17.7)
m
r
Search WWH ::




Custom Search