Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

reduction of the monetary charges incurred while utilizing the resources of the

processing infrastructure. The MRShare system [ 187 ] have been presented as a

sharing framework which is tailored to transform a batch of queries into a new batch

that will be executed more efficiently by merging jobs into groups and evaluating

each group as a single query. Based on a defined cost model, they described an

optimization problem that aims to derive the optimal grouping of queries in order

to avoid performing redundant work and thus resulting in significant savings on

both processing time and money. In particular, the approach considers exploiting

the following sharing opportunities:

Sharing scans . To share scans between two mapping pipelines M i and M j ,the

input data must be the same. In addition, the key/value pairs should be of the

same type. Given that, it becomes possible to merge the two pipelines into a

single pipeline and scan the input data only once. However, it should be noted

that such combined mapping will produce two streams of output tuples (one for

each mapping pipeline M i and M j ) . In order to distinguish the streams at the

reducer stage, each tuple is tagged with a tag() part. This tagging part is used

to indicate the origin mapping pipeline during the reduce phase.

Sharing map output . If the map output key and value types are the same for two

mapping pipelines M i and M j then the map output streams for M i and M j can

be shared. In particular, if Map i and Map j are applied to each input tuple, then

the map output tuples coming only from Map i are tagged with tag(i) only. If

a map output tuple was produced from an input tuple by both Map i and Map j ,

it is then tagged by tag(i)+tag(j) . Therefore, any overlapping parts of the

map output will be shared. In principle, producing a smaller map output leads to

savings on sorting and copying intermediate data over the network.

Sharing map functions . Sometimes the map functions are identical and thus they

can be executed once. At the end of the map stage, two streams are produced

where each is tagged with its job tag. If the map output is shared, then clearly

only one stream needs to be generated. Even if only some filters are common in

both jobs, it is possible to share parts of the map functions.

In practice, sharing scans and sharing map-output yield I/O savings while sharing

map functions (or parts of them) would yield additional CPU savings.

While the MRShare system focus on sharing the processing between queries

that are executed concurrently, the ReStore system [ 126 , 127 ] has been introduced

so that it can enable the queries that are submitted at different times to share

the intermediate results of previously executed jobs and reusing them for future

submitted jobs to the system. In particular, each MapReduce job produces output

that is stored in the distributed file system used by the MapReduce system (e.g.

HDFS). These intermediate results are kept (for a defined period) and managed so

that it can be used as input by subsequent jobs. ReStore can make use of whole jobs

or sub-jobs reuse opportunities. To achieve this goal, the ReStore consists of two

main components:

Search WWH ::

Custom Search

Home