Database Reference
In-Depth Information
￿
Repository of MapReduce job outputs : It stores the outputs of previously
executed MapReduce jobs and the physical plans of these jobs.
￿
Plan matcher and rewriter : Its aim is to find physical plans in the repository that
can be used to rewrite the input jobs using the available matching intermediate
results.
In principle, the approach of the ReStore system can be viewed as analogous to
the steps of building and using materialized views for relational databases [ 145 ].
Support of Data Indices and Column Storage
One of the main limitations of the original implementation of the MapReduce
framework is that it is designed in a way that the jobs can only scan the input data
in a sequential-oriented fashion. Hence, the query processing performance of the
MapReduce framework is unable to match the performance of a well-configured
parallel DBMS [ 194 ]. In order to tackle this challenge, Dittrich et al. [ 123 ]have
presented the Hadoop CC system which aims to boost the query performance of the
Hadoop system without changing any of the system internals. They achieved this
goal by injecting their changes through user-defined function (UDFs) which only
affect the Hadoop system from inside without any external effect. In particular, they
introduce the following main changes:
￿
Trojan index : The original Hadoop implementation does not provide index access
due to the lack of a priori knowledge of schema and the MapReduce jobs
being executed. Hence, the Hadoop CC system is based on the assumption
that if we know the schema and the anticipated MapReduce jobs, then we can
create appropriate indices for the Hadoop tasks. In particular, trojan index is an
approach to integrate indexing capability into Hadoop in a non-invasive way.
These indices are created during the data loading time and thus have no penalty
at query time. Each trojan index provides an optional index access path which
can be used for selective MapReduce jobs. The scan access path can still be used
for other MapReduce jobs. These indices are created by injecting appropriate
UDFs inside the Hadoop implementation. Specifically, the main features of trojan
indices can be summarized as follows:
-
No external library or engine : Trojan indices integrate indexing capability
natively into the Hadoop framework without imposing a distributed SQL-
query engine on top of it.
-
Non-invasive : They do not change the existing Hadoop framework. The index
structure is implemented by providing the right UDFs.
-
Optional access path : They provide an optional index access path which can
be used for selective MapReduce jobs. However, the scan access path can still
be used for other MapReduce jobs.
Search WWH ::




Custom Search