Toward Optimal Resource Provisioning for Economical and Green MapReduce Computing in the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

We take a combined approach instead. This method depends on the best-effort

analysis of the whole process of MapReduce processing framework, which will result

in the following cost function:

k

∑ β

TS

(, ,)

D

C

=

h

(, ,)

S DC

+

β

,

(17.1)

ii

0

i

=

1

where h i ( S , D , C ) are possibly some nonlinear transformations of the input factors S ,

D , and C , which are the time complexity of sequential processing components in the

system, and β i are the component weights, different from application to application.

h i ( S , D , C ) are obtained through the analysis of the MapReduce processing compo-

nents, while β i will be learned for a specific application based on sample runs of that

application. With this modeling idea in mind, in the following subsections, we will

conduct the modeling analysis, give a concrete formulation of the cost functions of

map task and reduce task to find h i ( S , D , C ), and finally integrate these components

into the whole cost function.

17.3.1 a nalyzing the P roCess oF m aP r eDuCe

MapReduce processing is a mix of sequential and parallel processing. The map

phase is executed before the reduce phase,* as Figure 17.1 shows. However, in each

phase many map or reduce processes are executed in parallel. To clearly describe

the MapReduce execution, we would like to distinguish the concepts of Map/Reduce

slot and Map/Reduce process . Each map (or reduce) process is executed in a map (or

reduce) slot. A slot is a unit of computing resources allocated for the corresponding

process. According to the system capacity, a computing node can only accommodate a

fixed number of slots so that the parallel processes can be run in the slots without seri-

ous competition. In Hadoop, the Tasktracker running in each slave node has to set the

number of map slots and the number of reduce slots. A common setting for a multicore

computer is to have two map and reduce slots per core. Without loss of generality, let

us assume there are m map slots and r reduce slots in total over all slave nodes.

We define a map/reduce process as a map/reduce task running on a specific slot.

By default, in Hadoop each map process handles one chunk of data (e.g., 64 MB).

Therefore, if there are M chunks of data, M map processes in total will be scheduled

and assigned to the m slots. In the ideal case, m map processes can run in parallel in

the m slots—we call it one round of map processes. If M > m , which is normal for

large data sets, [ M / m ] map rounds are needed.

Different from the total number of map processes, the number of reduce pro-

cesses, denoted as R , can be set by the user or determined by specific application

requirements. The map outputs, that is, the key-value pairs, are organized by the keys

and then distributed evenly by the keys to the R reduce processes. † Similarly, if

R > r , more than one round of reduce processes are scheduled. It is probably not very

* The copy operation in the reduce phase overlaps the map phase — when a map's result is ready, copy

may start immediately.

† Thus, it is not meaningful to set R greater than the number of output keys of map.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home