Database Reference
In-Depth Information
of MapReduce jobs. They also provide evaluation results for the SP 2 Bench queries
Q1, Q2, and Q3a on a Hadoop cluster of ten nodes similar to our cluster. A com-
parison of the results confirms that both approaches have a similar performance
whereby our implementation is more than 40% faster for Q3a. This demonstrates
that our approach based on mapping SPARQL to Pig Latin achieves an execution of
SPARQL queries that keeps up with a direct mapping to MapReduce with respect to
efficiency if not being more efficient. Another direct mapping approach is also pro-
posedĀ inĀ [33]. In contrast to these approaches, our translation supports all SPARQL
1.0 operators and also benefits from further developments of Pig [34]. As we map to
Pig Latin, we can expect a greater independence from possible changes inside the
underlying MapReduce layer in comparison to a direct mapping.
There is also a large body of work dealing with join processing in MapReduce
considering various aspects and application fields [14,15,35-39]. In Section 5.2.2, we
briefly outlined the advantages and drawbacks of the general-purpose reduce-side
and map-side (merge) join approaches in MapReduce. Though map-side joins are
generally more efficient, they are hard to cascade due to the strict preconditions.
Our MAPSIN approach leverages HBase to overcome the shortcomings of com-
mon map-side joins without the use of auxiliary shuffle and reduce phases, making
MAPSIN joins easily cascadable. In addition to these general-purpose approaches
there are several proposals focusing on certain join types or optimizations of existing
join techniques for particular application fields. In [37], the authors discussed how to
process arbitrary joins (theta joins) using MapReduce, whereas [35] focuses on opti-
mizing multiway joins. However, in contrast to our MAPSIN join, both approaches
process the join in the reduce phase including a costly data shuffle phase.
Map-reduce-merge [39] describes a modified MapReduce workflow by add-
ing a merge phase after the reduce phase, whereas map-join-reduce [36] proposes
a join phase in between the map and reduce phases. Both techniques attempt to
improve the support for joins in MapReduce but require profound modifications to
the MapReduce framework. In [40], the authors present a noninvasive index and join
techniques for SQL processing in MapReduce that also reduce the amount of shuf-
fled data at the cost of an additional co-partitioning and indexing phase at load time.
However, the schema and workload are assumed to be known in advance, which is
typically feasible for relational data but does not hold for RDF in general.
HadoopDB [41] is a hybrid of MapReduce and DBMS where MapReduce is the
communication layer above multiple single-node DBMS. The authors in [3] adopt this
hybrid approach for the semantic web using RDF-3X. However, the initial graph parti-
tioning is done on a single machine and has to be repeated if the data set is updated or
the number of machines in the cluster change. As we use HBase as an underlying stor-
age layer, additional machines can be plugged in seamlessly and updates are possible
without having to reload the entire data set. HadoopRDF [30] is a MapReduce-based
RDF system that stores data directly in HDFS and does also not require any changes to
the Hadoop framework. It is able to rebalance automatically when cluster size changes
but join processing is also done in the reduce phase. Our MAPSIN join does not use
any shuffle or reduce phase at all, even in consecutive iterations.
Instead of a general MapReduce cluster, some RDF stores are built on top of
a specialized computer cluster. Virtuoso Cluster Edition [42] is a cluster extension
Search WWH ::




Custom Search