Database Reference
In-Depth Information
17.3.2 C ost oF m aP P roCess
A map process can be divided into a number of sequential components, including
read, map, sort/partition, and optionally combine, as Figure 17.1 shows. We under-
stand this process in term of a data flow—data sequentially flow through each com-
ponent and the cost of each component depends on the amount of input data.
The first component is reading a block of data from the disk, which can be either
local or remote data block. Let us assume the average cost is a function of the size
of data block b : i ( b ).
The second component is the user defined map program, the time complexity
of which is determined by the input data size b , denoted as f ( b ). The map program
may output data in size of o m ( b ) that might vary depending on the specific data. The
output will be a list of 〈 key , value 〉 pairs.
The result will be partitioned and sorted by the key into R shares for the R reduce
processes. We denote the cost of partitioning and sorting with s ( o m ( b ), R ). If the
partitioning process uses a hash function to map the keys, the partitioning cost is
independent of R . However, the sorting phase is still affected by R . Let us skip the
combiner component temporarily and we will revisit the combiner component later.
In summary, the overall cost of a map process is the sum of the costs (without the
combiner component):
Φ m = i ( b ) + f ( b ) + s ( o m ( b ), R ) + ϵ m .
(17.2)
i ( b ) and f ( b ) are only related to the size of the data block b and the complexity of
the map program, independent of the parameters m and M . ϵ m has a mean zero and
some variance σ 2 , which needs to be calibrated by experiments. We also observed
that s ( o m ( b ), R ) is slightly linear to R . In practice, we can model it with parameters
m , M , r , R as
Φ m ( m, M, r, R ) = μ 1 + μ 2 + ϵ m ,
(17.3)
where μ 1 , μ 2 , and the distribution of ϵ m are constants and specific to each application.
17.3.3 C ost oF r eDuCe P roCess
The reduce process has the components: Copy, MergeSort, Reduce, and WriteResult.
These components are also sequentially executed in the reduce process.
Assume that the k keys of the map result are equally distributed to the R reduce
processes.* In the copy component, each reduce process pulls its shares, that is, k / R
keys and the corresponding records, from the M map processes' outputs. Thus, the
total amount of data in each reduce will be
b R = M o m ( b ) k / R
(17.4)
* For this reason, the user normally selects R to satisfy k R . If R > k , only k reduces are actually used.
Search WWH ::




Custom Search