Large-Scale RDF Processing with MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

of MapReduce jobs. They also provide evaluation results for the SP 2 Bench queries

Q1, Q2, and Q3a on a Hadoop cluster of ten nodes similar to our cluster. A com-

parison of the results confirms that both approaches have a similar performance

whereby our implementation is more than 40% faster for Q3a. This demonstrates

that our approach based on mapping SPARQL to Pig Latin achieves an execution of

SPARQL queries that keeps up with a direct mapping to MapReduce with respect to

efficiency if not being more efficient. Another direct mapping approach is also pro-

posed in [33]. In contrast to these approaches, our translation supports all SPARQL

1.0 operators and also benefits from further developments of Pig [34]. As we map to

Pig Latin, we can expect a greater independence from possible changes inside the

underlying MapReduce layer in comparison to a direct mapping.

There is also a large body of work dealing with join processing in MapReduce

considering various aspects and application fields [14,15,35-39]. In Section 5.2.2, we

briefly outlined the advantages and drawbacks of the general-purpose reduce-side

and map-side (merge) join approaches in MapReduce. Though map-side joins are

generally more efficient, they are hard to cascade due to the strict preconditions.

Our MAPSIN approach leverages HBase to overcome the shortcomings of com-

mon map-side joins without the use of auxiliary shuffle and reduce phases, making

MAPSIN joins easily cascadable. In addition to these general-purpose approaches

there are several proposals focusing on certain join types or optimizations of existing

join techniques for particular application fields. In [37], the authors discussed how to

process arbitrary joins (theta joins) using MapReduce, whereas [35] focuses on opti-

mizing multiway joins. However, in contrast to our MAPSIN join, both approaches

process the join in the reduce phase including a costly data shuffle phase.

Map-reduce-merge [39] describes a modified MapReduce workflow by add-

ing a merge phase after the reduce phase, whereas map-join-reduce [36] proposes

a join phase in between the map and reduce phases. Both techniques attempt to

improve the support for joins in MapReduce but require profound modifications to

the MapReduce framework. In [40], the authors present a noninvasive index and join

techniques for SQL processing in MapReduce that also reduce the amount of shuf-

fled data at the cost of an additional co-partitioning and indexing phase at load time.

However, the schema and workload are assumed to be known in advance, which is

typically feasible for relational data but does not hold for RDF in general.

HadoopDB [41] is a hybrid of MapReduce and DBMS where MapReduce is the

communication layer above multiple single-node DBMS. The authors in [3] adopt this

hybrid approach for the semantic web using RDF-3X. However, the initial graph parti-

tioning is done on a single machine and has to be repeated if the data set is updated or

the number of machines in the cluster change. As we use HBase as an underlying stor-

age layer, additional machines can be plugged in seamlessly and updates are possible

without having to reload the entire data set. HadoopRDF [30] is a MapReduce-based

RDF system that stores data directly in HDFS and does also not require any changes to

the Hadoop framework. It is able to rebalance automatically when cluster size changes

but join processing is also done in the reduce phase. Our MAPSIN join does not use

any shuffle or reduce phase at all, even in consecutive iterations.

Instead of a general MapReduce cluster, some RDF stores are built on top of

a specialized computer cluster. Virtuoso Cluster Edition [42] is a cluster extension

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home