Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

shared, then clearly only one stream needs to be generated. Even if only

some filters are common in both jobs, it is possible to share parts of the map

functions.

In practice, sharing scans and sharing map-output yield I/O savings while sharing

map functions (or parts of them) would yield additional CPU savings.

While the MRShare system focus on sharing the processing between queries that

are executed concurrently, the ReStore system [49,50] has been introduced so that it

can enable the queries that are submitted at different times to share the intermediate

results of previously executed jobs and reusing them for future submitted jobs to the

system. In particular, each MapReduce job produces output that is stored in the dis-

tributed file system used by the MapReduce system (e.g., HDFS). These intermediate

results are kept (for a defined period) and managed so that it can be used as input by

subsequent jobs. ReStore can make use of whole jobs or sub-jobs reuse opportuni-

ties. To achieve this goal, the ReStore consists of two main components:

•

Repository of MapReduce job outputs : It stores the outputs of previously

executed MapReduce jobs and the physical plans of these jobs.

•

Plan matcher and rewriter : Its aim is to find physical plans in the reposi-

tory that can be used to rewrite the input jobs using the available matching

intermediate results.

In principle, the approach of the ReStore system can be viewed as analogous to

the steps of building and using materialized views for relational databases [62].

2.3.4 s uPPort oF D ata i nDiCes anD C olumn s torage

One of the main limitations of the original implementation of the MapReduce

framework is that it is designed in a way that the jobs can only scan the input data

in a sequential-oriented fashion. Hence, the query processing performance of the

MapReduce framework is unable to match the performance of a well-configured

parallel DBMS [113]. To tackle this challenge, Dittrich et al. [47] have presented

the Hadoop++ system, which aims to boost the query performance of the Hadoop

system without changing any of the system internals. They achieved this goal by

injecting their changes through user-defined function (UDFs), which only affect the

Hadoop system from inside without any external effect. In particular, they introduce

the following main changes:

•

Trojan Index : The original Hadoop implementation does not provide index

access due to the lack of a priori knowledge of schema and the MapReduce

jobs being executed. Hence, the Hadoop++ system is based on the assump-

tion that if we know the schema and the anticipated MapReduce jobs, then

we can create appropriate indices for the Hadoop tasks. In particular, Trojan

index is an approach to integrate indexing capability into Hadoop in a non-

invasive way. These indices are created during the data-loading time and

thus have no penalty at query time. Each Trojan index provides an optional

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home