Large-Scale RDF Processing with MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

of the well-known Virtuoso RDF store. 4store [7] is a ready-to-use RDF store that

divides the cluster in storage and processing nodes. Nevertheless, the usage of a spe-

cialized cluster has the disadvantage that it requires a dedicated infrastructure and

pose additional installation and management overhead whereas our approach builds

upon the idea to use existing infrastructures that are well known and widely used.

As we do not require any changes to Hadoop or HBase at all, it is possible to use any

existing Hadoop cluster or cloud service (e.g., Amazon EC2) out of the box.

5.9 CONCLUSION

In this chapter, we presented PigSPARQL, a new approach for the scalable execu-

tion of SPARQL queries on very large RDF data sets. For this purpose, we designed

and implemented a translation from SPARQL to Pig Latin. The resulting Pig Latin

program is translated into a sequence of MapReduce jobs and executed in parallel

on a Hadoop cluster. Following such an approach, we benefit from further devel-

opments of Apache Pig without any additional programming effort. This includes

performance enhancements as well as major changes of Hadoop like the upcoming

YARN (MRv2) framework. PigSPARQL is available for download and can be used

out of the box on every Hadoop cluster with Apache Pig installed, since there is

neither an installation nor a configuration process required. Our evaluation with a

SPARQL specific benchmark confirmed that PigSPARQL is well suited for the scal-

able execution of SPARQL queries on large RDF data sets with Hadoop. This is also

demonstrated by the used data set size of up to 1.6 billion RDF triples that already

exceeds the capabilities of many single machine systems [10].

Although PigSPARQL offers an easy and efficient way to take advantage of the

performance and scalability of Hadoop for the distributed and parallelized execution

of SPARQL queries, the performance of selective queries was not satisfying. This

is attributed to principle of a reduce-side join, where dangling tuples are thrown

out in the reduce phase, which causes a large amount of probably unneeded data

that has to be shuffled through the network in general. To overcome this issue, we

introduced the Map-Side Index Nested Loop Join (MAPSIN join), which combines

the advantages of the NoSQL data store HBase with the well known and approved

distributed processing facilities of MapReduce. In general, map-side joins are more

efficient than reduce-side joins in MapReduce as there is no expensive data shuffle

phase involved. However, current map-side join approaches suffer from strict pre-

conditions what makes them hard to apply in general, especially in a sequence of

joins. The combination of HBase and MapReduce allows us to cascade a sequence of

MAPSIN joins without having to sort and repartition the intermediate output for the

next iteration. Furthermore, with the multiway join optimization, we can reduce the

number of MapReduce iterations and HBase requests. Using an index to selectively

request only those data that is really needed also saves network bandwidth, making

parallel query execution more efficient. The evaluation with the LUBM benchmark

demonstrated the advantages of our approach compared with the commonly used

reduce-side join approach. For selective queries, the MAPSIN join based SPARQL

query execution outperformed the reduce-side join based execution in PigSPARQL

by an order of magnitude while scaling very smoothly with the input size. Lastly,

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home