Algebraic Optimization of RDF Graph Pattern Queries on MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

processing are episodic and elastic so that the trend is to leverage resources from the

cloud when possible. A number of cloud data-processing platforms have emerged in

recent times to support such applications, with many of them based on query pro-

cessing infrastructure similar to relational query engines. However, relational model

and algebra have some limitations with respect to the requirements of Semantic Web

processing—large number of joins, irregular structure, inferencing with querying,

and these limitations have an appreciable negative impact in MapReduce context.

This chapter reviews query evaluation techniques for graph pattern queries on

MapReduce platforms in terms of two query algebras: derivatives of relational

algebra and an alternative algebra called the Nested TripleGroup Data Model and

Algebra (NTGA) . It discusses the advantage of NTGA over relational-style query

plans and data representation, due to concurrent execution of “star joins,” which

reduces workflow length and enables shared table scans while keeping the footprint

of intermediate results minimized. The chapter presents some evaluation results that

show up to 60% performance advantage for relatively basic queries involving 2 to 3

star patterns. This advantage is expected to be even larger in more complex queries

with more star patterns because of the concurrent star-join execution enabled by

NTGA plans.

Ongoing and future work in NTGA optimization is focused on including neces-

sary extensions (logical and physical operators and query rewriting rules) to enable

translation of more complex graph pattern queries like graph patterns with unbound

properties, with optional fragments, ontological queries, and analytical queries to

NTGA. Some preliminary results for some of these more complex classes, specifi-

cally ontological queries, have shown up to orders of magnitude in performance

advantage and are thus very promising.

REFERENCES

1. Apache HBase. http://hbase.apache.org/.

2. Billion Triple Challenge. http://challenge.semanticweb.org/.

3. Open Science Data Cloud. https://www.opensciencedatacloud.org/.

4. SPARQL S-Expressions. http://jena.apache.org/documentation/notes/sse.html.

5. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and

Alexander Rasin. HadoopDB: An Architectural Hybrid of MapReduce and DBMS

Technologies for Analytical Workloads. Proc. VLDB , 2:922-933, 2009.

6. Foto N. Afrati and Jeffrey D. Ullman. Optimizing Multiway Joins in a Map-Reduce

Environment. Proc. TKDE , 23(9):1282-1298, 2011.

7. Kemafor Anyanwu, HyeongSik Kim, and Padmashree Ravindra. Algebraic Optimization

for Processing Graph Pattern Queries in the Cloud. IEEE Internet Comput. , 17(2):52-

61, 2013.

8. Medha Atre, Vineet Chaoji, Mohammed J. Zaki, and James A. Hendler. Matrix Bit

Loaded: A Scalable Lightweight Join Query Processor for RDF data. In Proc. Int. Conf.

World Wide Web , pp. 41-50, 2010.

9. Andrzej Bialecki, Michael Cafarella, Doug Cutting, and Owen O'Malley. Hadoop: A

Framework for Running Applications on Large Clusters Built of Commodity Hardware .

10. Barry Bishop, Atanas Kiryakov, Damyan Ognyanoff, Ivan Peikov, Zdravko Tashev, and

Ruslan Velkov. OWLIM: A Family of Scalable Semantic Repositories. Semantic Web ,

2(1):33-42, 2011.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home