Database Reference
In-Depth Information
extended programming model that supports iterative MapReduce computations effi-
ciently [48]. It uses a publish/subscribe messaging infrastructure for communica-
tion and data transfers and supports long running map/reduce tasks. In particular,
it provides programming extensions to MapReduce with broadcast and scatter-type
data transfers. Microsoft has also developed a project that provides an iterative
MapReduce runtime for Windows Azure called Daytona .*
2.3.3 D ata anD P roCess s haring
With the emergence of cloud computing, the use of an analytical query-processing
infrastructure (e.g., Amazon EC2) can be directly mapped to monetary value. Taking
into account that different MapReduce jobs can perform similar work, there could be
many opportunities for sharing the execution of their work. Thus, this sharing can
reduce the overall amount of work, which consequently leads to the reduction of the
monetary charges incurred while utilizing the resources of the processing infrastruc-
ture. The MRShare system [107] have been presented as a sharing framework that is
tailored to transform a batch of queries into a new batch that will be executed more
efficiently by merging jobs into groups and evaluating each group as a single query.
Based on a defined cost model, they described an optimization problem that aims to
derive the optimal grouping of queries to avoid performing redundant work and thus
resulting in significant savings on both processing time and money. In particular, the
approach considers exploiting the following sharing opportunities:
Sharing Scans . To share scans between two mapping pipelines M i and M j ,
the input data must be the same. In addition, the key/value pairs should be
of the same type. Given that, it becomes possible to merge the two pipelines
into a single pipeline and scan the input data only once. However, it should
be noted that such combined mapping will produce two streams of output
tuples (one for each mapping pipeline M i and M j ). To distinguish the streams
at the reducer stage, each tuple is tagged with a tag() part. This tagging
part is used to indicate the origin mapping pipeline during the reduce phase.
Sharing Map Output . If the map output key and value types are the same
for two mapping pipelines M i and M j , then the map output streams for M i
and M j can be shared. In particular, if Map i and Map j are applied to each
input tuple, then the map output tuples coming only from Map i are tagged
with tag(i) only. If a map output tuple was produced from an input tuple
by both Map i and Map j , it is then tagged by tag(i)+tag(j) . Therefore,
any overlapping parts of the map output will be shared. In principle, pro-
ducing a smaller map output leads to savings on sorting and copying inter-
mediate data over the network.
Sharing Map Functions . Sometimes the map functions are identical, and
thus, they can be executed once. At the end of the map stage, two streams
are produced where each is tagged with its job tag. If the map output is
* http://research.microsoft.com/en-us/projects/daytona/.
Search WWH ::




Custom Search