Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

Repository of MapReduce job outputs : It stores the outputs of previously

executed MapReduce jobs and the physical plans of these jobs.

Plan matcher and rewriter : Its aim is to find physical plans in the repository that

can be used to rewrite the input jobs using the available matching intermediate

results.

In principle, the approach of the ReStore system can be viewed as analogous to

the steps of building and using materialized views for relational databases [ 145 ].

Support of Data Indices and Column Storage

One of the main limitations of the original implementation of the MapReduce

framework is that it is designed in a way that the jobs can only scan the input data

in a sequential-oriented fashion. Hence, the query processing performance of the

MapReduce framework is unable to match the performance of a well-configured

parallel DBMS [ 194 ]. In order to tackle this challenge, Dittrich et al. [ 123 ]have

presented the Hadoop CC system which aims to boost the query performance of the

Hadoop system without changing any of the system internals. They achieved this

goal by injecting their changes through user-defined function (UDFs) which only

affect the Hadoop system from inside without any external effect. In particular, they

introduce the following main changes:

Trojan index : The original Hadoop implementation does not provide index access

due to the lack of a priori knowledge of schema and the MapReduce jobs

being executed. Hence, the Hadoop CC system is based on the assumption

that if we know the schema and the anticipated MapReduce jobs, then we can

create appropriate indices for the Hadoop tasks. In particular, trojan index is an

approach to integrate indexing capability into Hadoop in a non-invasive way.

These indices are created during the data loading time and thus have no penalty

at query time. Each trojan index provides an optional index access path which

can be used for selective MapReduce jobs. The scan access path can still be used

for other MapReduce jobs. These indices are created by injecting appropriate

UDFs inside the Hadoop implementation. Specifically, the main features of trojan

indices can be summarized as follows:

-

No external library or engine : Trojan indices integrate indexing capability

natively into the Hadoop framework without imposing a distributed SQL-

query engine on top of it.

-

Non-invasive : They do not change the existing Hadoop framework. The index

structure is implemented by providing the right UDFs.

-

Optional access path : They provide an optional index access path which can

be used for selective MapReduce jobs. However, the scan access path can still

be used for other MapReduce jobs.

Search WWH ::

Custom Search

Home