Toward Optimal Resource Provisioning for Economical and Green MapReduce Computing in the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

Here, we simplify the analysis by assuming the amount of data is proportional to the

number of keys assigned to the reduce. In practice, many applications have skewed

data distributions, that is, some keys may have more records while other may have less,

which may affect the quality of modeling.

The copy cost is linear to b R , denoted as c ( b R ). However, most of the time is

overlapped with the map phase. Normally only the last few rounds of map process-

ing may contribute to the overall time cost. We thus approximate the cost as c ( b R ) ~

m o m ( b ) k / R .

A merge process follows to merge the M shares from the map results. Because the

records are already sorted by the key, this process simply merges the shares by the

key in multiple rounds. Assume the buffer size is B , the merge round i will generate

M / B i files, and its cost is proportional to b R . The total number of rounds is ⌈log B M

rceil ⌉. Thus, the total merge cost ms ( b R ) is proportional to b R ⌈log B M ⌉.

The reduce program will process the data with some complexity g ( b R ) that

depends on the specific application. Assume the output data of the reduce program

has an amount o r ( b R ), which is often less than b R . Finally, the result is duplicated

and written back to multiple nodes, with the complexity linear to o r ( b R ), denoted as

wr ( o r ( b R )).

In summary, the cost of the reduce process is the sum of the component costs,

Φ r = c ( b R ) + ms ( b R ) + g ( b R ) + wr ( o r ( b R )) + ϵ r ,

(17. 5)

Both the Copy and the WriteResult costs may vary because of the varying network

I/O performance, which are modeled with the random variable ϵ r . Similar to ϵ m for

the map phase, ϵ r has a mean zero and some variance σ 2 . These variances should be

captured in modeling.

If we model Φ r with m , M , r , R , and keep the relevant components for each phase,

we have

Φ r ( m , M , r , R ) = λ 1 ( m / R ) + λ 2 ( M log M / R ) + g ( M / R ) + λ 3 ( M / R ) + ϵ r ,

(17.6)

where λ 1 and the distribution of ϵ r are application-specific constants.

17.3.4 P utting i t a ll t ogether

According to the parallel execution model we described in Figure 17.2, the overall

time complexity T depends on the number of map rounds and reduce rounds. The

cost of managing and scheduling the map and reduce processes Θ( M , R ) = ξ 1 M + ξ 2 R

is linear to M and R , as stated in the documentation [22]. By assuming all the pro-

cesses in each map (or reduce) round finish around the same time, we can represent

the overall cost as

Φ Θ(,) .

(17.7)

Search WWH ::

Custom Search

Home