Database Reference
In-Depth Information
of the well-known Virtuoso RDF store. 4store [7] is a ready-to-use RDF store that
divides the cluster in storage and processing nodes. Nevertheless, the usage of a spe-
cialized cluster has the disadvantage that it requires a dedicated infrastructure and
pose additional installation and management overhead whereas our approach builds
upon the idea to use existing infrastructures that are well known and widely used.
As we do not require any changes to Hadoop or HBase at all, it is possible to use any
existing Hadoop cluster or cloud service (e.g., Amazon EC2) out of the box.
5.9 CONCLUSION
In this chapter, we presented PigSPARQL, a new approach for the scalable execu-
tion of SPARQL queries on very large RDF data sets. For this purpose, we designed
and implemented a translation from SPARQL to Pig Latin. The resulting Pig Latin
program is translated into a sequence of MapReduce jobs and executed in parallel
on a Hadoop cluster. Following such an approach, we benefit from further devel-
opments of Apache Pig without any additional programming effort. This includes
performance enhancements as well as major changes of Hadoop like the upcoming
YARN (MRv2) framework. PigSPARQL is available for download and can be used
out of the box on every Hadoop cluster with Apache Pig installed, since there is
neither an installation nor a configuration process required. Our evaluation with a
SPARQL specific benchmark confirmed that PigSPARQL is well suited for the scal-
able execution of SPARQL queries on large RDF data sets with Hadoop. This is also
demonstrated by the used data set size of up to 1.6 billion RDF triples that already
exceeds the capabilities of many single machine systems [10].
Although PigSPARQL offers an easy and efficient way to take advantage of the
performance and scalability of Hadoop for the distributed and parallelized execution
of SPARQL queries, the performance of selective queries was not satisfying. This
is attributed to principle of a reduce-side join, where dangling tuples are thrown
out in the reduce phase, which causes a large amount of probably unneeded data
that has to be shuffled through the network in general. To overcome this issue, we
introduced the Map-Side Index Nested Loop Join (MAPSIN join), which combines
the advantages of the NoSQL data store HBase with the well known and approved
distributed processing facilities of MapReduce. In general, map-side joins are more
efficient than reduce-side joins in MapReduce as there is no expensive data shuffle
phase involved. However, current map-side join approaches suffer from strict pre-
conditions what makes them hard to apply in general, especially in a sequence of
joins. The combination of HBase and MapReduce allows us to cascade a sequence of
MAPSIN joins without having to sort and repartition the intermediate output for the
next iteration. Furthermore, with the multiway join optimization, we can reduce the
number of MapReduce iterations and HBase requests. Using an index to selectively
request only those data that is really needed also saves network bandwidth, making
parallel query execution more efficient. The evaluation with the LUBM benchmark
demonstrated the advantages of our approach compared with the commonly used
reduce-side join approach. For selective queries, the MAPSIN join based SPARQL
query execution outperformed the reduce-side join based execution in PigSPARQL
by an order of magnitude while scaling very smoothly with the input size. Lastly,
Search WWH ::




Custom Search