Database Reference
In-Depth Information
17.3.2 C
ost
oF
m
aP
P
roCess
A map process can be divided into a number of sequential components, including
read, map, sort/partition, and optionally combine, as Figure 17.1 shows. We under-
stand this process in term of a data flow—data sequentially flow through each com-
ponent and the cost of each component depends on the amount of input data.
The first component is reading a block of data from the disk, which can be either
local or remote data block. Let us assume the
average cost
is a function of the size
of data block
b
:
i
(
b
).
The second component is the user defined map program, the time complexity
of which is determined by the input data size
b
, denoted as
f
(
b
). The map program
may output data in size of
o
m
(
b
) that might vary depending on the specific data. The
output will be a list of 〈
key
,
value
〉 pairs.
The result will be partitioned and sorted by the key into
R
shares for the
R
reduce
processes. We denote the cost of partitioning and sorting with
s
(
o
m
(
b
),
R
). If the
partitioning process uses a hash function to map the keys, the partitioning cost is
independent of
R
. However, the sorting phase is still affected by
R
. Let us skip the
combiner component temporarily and we will revisit the combiner component later.
In summary, the overall cost of a map process is the sum of the costs (without the
combiner component):
Φ
m
=
i
(
b
) +
f
(
b
) +
s
(
o
m
(
b
),
R
) + ϵ
m
.
(17.2)
i
(
b
) and
f
(
b
) are only related to the size of the data block
b
and the complexity of
the map program, independent of the parameters
m
and
M
. ϵ
m
has a mean zero and
some variance
σ
2
, which needs to be calibrated by experiments. We also observed
that
s
(
o
m
(
b
),
R
) is slightly linear to
R
. In practice, we can model it with parameters
m
,
M
,
r
,
R
as
Φ
m
(
m, M, r, R
) = μ
1
+ μ
2
+ ϵ
m
,
(17.3)
where μ
1
, μ
2
, and the distribution of ϵ
m
are constants and specific to each application.
17.3.3 C
ost
oF
r
eDuCe
P
roCess
The reduce process has the components: Copy, MergeSort, Reduce, and WriteResult.
These components are also sequentially executed in the reduce process.
Assume that the
k
keys of the map result are equally distributed to the
R
reduce
processes.* In the copy component, each reduce process pulls its shares, that is,
k
/
R
keys and the corresponding records, from the
M
map processes' outputs. Thus, the
total amount of data in each reduce will be
b
R
=
M o
m
(
b
)
k
/
R
(17.4)
*
For this reason, the user normally selects
R
to satisfy
k
≥
R
. If
R
>
k
, only
k
reduces are actually used.
Search WWH ::
Custom Search