Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

extended programming model that supports iterative MapReduce computations effi-

ciently [48]. It uses a publish/subscribe messaging infrastructure for communica-

tion and data transfers and supports long running map/reduce tasks. In particular,

it provides programming extensions to MapReduce with broadcast and scatter-type

data transfers. Microsoft has also developed a project that provides an iterative

MapReduce runtime for Windows Azure called Daytona .*

2.3.3 D ata anD P roCess s haring

With the emergence of cloud computing, the use of an analytical query-processing

infrastructure (e.g., Amazon EC2) can be directly mapped to monetary value. Taking

into account that different MapReduce jobs can perform similar work, there could be

many opportunities for sharing the execution of their work. Thus, this sharing can

reduce the overall amount of work, which consequently leads to the reduction of the

monetary charges incurred while utilizing the resources of the processing infrastruc-

ture. The MRShare system [107] have been presented as a sharing framework that is

tailored to transform a batch of queries into a new batch that will be executed more

efficiently by merging jobs into groups and evaluating each group as a single query.

Based on a defined cost model, they described an optimization problem that aims to

derive the optimal grouping of queries to avoid performing redundant work and thus

resulting in significant savings on both processing time and money. In particular, the

approach considers exploiting the following sharing opportunities:

•

Sharing Scans . To share scans between two mapping pipelines M i and M j ,

the input data must be the same. In addition, the key/value pairs should be

of the same type. Given that, it becomes possible to merge the two pipelines

into a single pipeline and scan the input data only once. However, it should

be noted that such combined mapping will produce two streams of output

tuples (one for each mapping pipeline M i and M j ). To distinguish the streams

at the reducer stage, each tuple is tagged with a tag() part. This tagging

part is used to indicate the origin mapping pipeline during the reduce phase.

•

Sharing Map Output . If the map output key and value types are the same

for two mapping pipelines M i and M j , then the map output streams for M i

and M j can be shared. In particular, if Map i and Map j are applied to each

input tuple, then the map output tuples coming only from Map i are tagged

with tag(i) only. If a map output tuple was produced from an input tuple

by both Map i and Map j , it is then tagged by tag(i)+tag(j) . Therefore,

any overlapping parts of the map output will be shared. In principle, pro-

ducing a smaller map output leads to savings on sorting and copying inter-

mediate data over the network.

•

Sharing Map Functions . Sometimes the map functions are identical, and

thus, they can be executed once. At the end of the map stage, two streams

are produced where each is tagged with its job tag. If the map output is

* http://research.microsoft.com/en-us/projects/daytona/.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home