Toward Optimal Resource Provisioning for Economical and Green MapReduce Computing in the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

17.3.2 C ost oF m aP P roCess

A map process can be divided into a number of sequential components, including

read, map, sort/partition, and optionally combine, as Figure 17.1 shows. We under-

stand this process in term of a data flow—data sequentially flow through each com-

ponent and the cost of each component depends on the amount of input data.

The first component is reading a block of data from the disk, which can be either

local or remote data block. Let us assume the average cost is a function of the size

of data block b : i ( b ).

The second component is the user defined map program, the time complexity

of which is determined by the input data size b , denoted as f ( b ). The map program

may output data in size of o m ( b ) that might vary depending on the specific data. The

output will be a list of 〈 key , value 〉 pairs.

The result will be partitioned and sorted by the key into R shares for the R reduce

processes. We denote the cost of partitioning and sorting with s ( o m ( b ), R ). If the

partitioning process uses a hash function to map the keys, the partitioning cost is

independent of R . However, the sorting phase is still affected by R . Let us skip the

combiner component temporarily and we will revisit the combiner component later.

In summary, the overall cost of a map process is the sum of the costs (without the

combiner component):

Φ m = i ( b ) + f ( b ) + s ( o m ( b ), R ) + ϵ m .

(17.2)

i ( b ) and f ( b ) are only related to the size of the data block b and the complexity of

the map program, independent of the parameters m and M . ϵ m has a mean zero and

some variance σ 2 , which needs to be calibrated by experiments. We also observed

that s ( o m ( b ), R ) is slightly linear to R . In practice, we can model it with parameters

m , M , r , R as

Φ m ( m, M, r, R ) = μ 1 + μ 2 + ϵ m ,

(17.3)

where μ 1 , μ 2 , and the distribution of ϵ m are constants and specific to each application.

17.3.3 C ost oF r eDuCe P roCess

The reduce process has the components: Copy, MergeSort, Reduce, and WriteResult.

These components are also sequentially executed in the reduce process.

Assume that the k keys of the map result are equally distributed to the R reduce

processes.* In the copy component, each reduce process pulls its shares, that is, k / R

keys and the corresponding records, from the M map processes' outputs. Thus, the

total amount of data in each reduce will be

b R = M o m ( b ) k / R

(17.4)

* For this reason, the user normally selects R to satisfy k ≥ R . If R > k , only k reduces are actually used.

Search WWH ::

Custom Search

Home