Database Reference
In-Depth Information
reduction of the monetary charges incurred while utilizing the resources of the
processing infrastructure. The MRShare system [ 187 ] have been presented as a
sharing framework which is tailored to transform a batch of queries into a new batch
that will be executed more efficiently by merging jobs into groups and evaluating
each group as a single query. Based on a defined cost model, they described an
optimization problem that aims to derive the optimal grouping of queries in order
to avoid performing redundant work and thus resulting in significant savings on
both processing time and money. In particular, the approach considers exploiting
the following sharing opportunities:
￿
Sharing scans . To share scans between two mapping pipelines M i and M j ,the
input data must be the same. In addition, the key/value pairs should be of the
same type. Given that, it becomes possible to merge the two pipelines into a
single pipeline and scan the input data only once. However, it should be noted
that such combined mapping will produce two streams of output tuples (one for
each mapping pipeline M i and M j ) . In order to distinguish the streams at the
reducer stage, each tuple is tagged with a tag() part. This tagging part is used
to indicate the origin mapping pipeline during the reduce phase.
￿
Sharing map output . If the map output key and value types are the same for two
mapping pipelines M i and M j then the map output streams for M i and M j can
be shared. In particular, if Map i and Map j are applied to each input tuple, then
the map output tuples coming only from Map i are tagged with tag(i) only. If
a map output tuple was produced from an input tuple by both Map i and Map j ,
it is then tagged by tag(i)+tag(j) . Therefore, any overlapping parts of the
map output will be shared. In principle, producing a smaller map output leads to
savings on sorting and copying intermediate data over the network.
￿
Sharing map functions . Sometimes the map functions are identical and thus they
can be executed once. At the end of the map stage, two streams are produced
where each is tagged with its job tag. If the map output is shared, then clearly
only one stream needs to be generated. Even if only some filters are common in
both jobs, it is possible to share parts of the map functions.
In practice, sharing scans and sharing map-output yield I/O savings while sharing
map functions (or parts of them) would yield additional CPU savings.
While the MRShare system focus on sharing the processing between queries
that are executed concurrently, the ReStore system [ 126 , 127 ] has been introduced
so that it can enable the queries that are submitted at different times to share
the intermediate results of previously executed jobs and reusing them for future
submitted jobs to the system. In particular, each MapReduce job produces output
that is stored in the distributed file system used by the MapReduce system (e.g.
HDFS). These intermediate results are kept (for a defined period) and managed so
that it can be used as input by subsequent jobs. ReStore can make use of whole jobs
or sub-jobs reuse opportunities. To achieve this goal, the ReStore consists of two
main components:
Search WWH ::




Custom Search