Database Reference
In-Depth Information
shared, then clearly only one stream needs to be generated. Even if only
some filters are common in both jobs, it is possible to share parts of the map
functions.
In practice, sharing scans and sharing map-output yield I/O savings while sharing
map functions (or parts of them) would yield additional CPU savings.
While the MRShare system focus on sharing the processing between queries that
are executed concurrently, the ReStore system [49,50] has been introduced so that it
can enable the queries that are submitted at different times to share the intermediate
results of previously executed jobs and reusing them for future submitted jobs to the
system. In particular, each MapReduce job produces output that is stored in the dis-
tributed file system used by the MapReduce system (e.g., HDFS). These intermediate
results are kept (for a defined period) and managed so that it can be used as input by
subsequent jobs. ReStore can make use of whole jobs or sub-jobs reuse opportuni-
ties. To achieve this goal, the ReStore consists of two main components:
Repository of MapReduce job outputs : It stores the outputs of previously
executed MapReduce jobs and the physical plans of these jobs.
Plan matcher and rewriter : Its aim is to find physical plans in the reposi-
tory that can be used to rewrite the input jobs using the available matching
intermediate results.
In principle, the approach of the ReStore system can be viewed as analogous to
the steps of building and using materialized views for relational databases [62].
2.3.4 s uPPort oF D ata i nDiCes anD C olumn s torage
One of the main limitations of the original implementation of the MapReduce
framework is that it is designed in a way that the jobs can only scan the input data
in a sequential-oriented fashion. Hence, the query processing performance of the
MapReduce framework is unable to match the performance of a well-configured
parallel DBMS [113]. To tackle this challenge, Dittrich et al. [47] have presented
the Hadoop++ system, which aims to boost the query performance of the Hadoop
system without changing any of the system internals. They achieved this goal by
injecting their changes through user-defined function (UDFs), which only affect the
Hadoop system from inside without any external effect. In particular, they introduce
the following main changes:
Trojan Index : The original Hadoop implementation does not provide index
access due to the lack of a priori knowledge of schema and the MapReduce
jobs being executed. Hence, the Hadoop++ system is based on the assump-
tion that if we know the schema and the anticipated MapReduce jobs, then
we can create appropriate indices for the Hadoop tasks. In particular, Trojan
index is an approach to integrate indexing capability into Hadoop in a non-
invasive way. These indices are created during the data-loading time and
thus have no penalty at query time. Each Trojan index provides an optional
Search WWH ::




Custom Search