Algebraic Optimization of RDF Graph Pattern Queries on MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

Hive [44] provide high-level data flow or query languages and automatic optimiza-

tion techniques a la relational database systems, and are gaining popularity.

In the context of scalable processing of Semantic Web data, query processing

techniques have to address the unique challenges posed by semantic graph data mod-

els such as W3C's Resource Description Framework (RDF). In RDF, data is mod-

eled at a fine-grained level in the form of what are called statements or triples . A

triple ( Subject , Property , Object ) asserts that a resource (the Subject ) has a Property

whose value (the Object ) is another resource or a literal. Figure 6.1a shows a subset

of triples describing vendor-product offerings in the commerce context. The exam-

ple triple ( &Off 1, product , &Pr 1) states that an offer &Off 1 is for a product &Pr 1.

Further, nodes and edges may be linked to an ontological layer that defines types of

nodes and edges and relationships between node types and edge types. Therefore, an

RDF model may entail additional statements beyond what is explicitly asserted, and

entailed statements can be derived using inferencing techniques. An RDF database

is a collection of triples but can also be viewed as a directed labeled graph with nodes

representing subjects and objects , and labeled edges denoting property types.

The foundational construct for querying an RDF graph is a triple pattern , which

is a triple with a variable (denoted by leading '?') in any of the Subject, Property, or

Object positions, for example, (? o vendor ? v ). A triple t in the database is considered

a valid “match” for a triple pattern tp if some substitution of the variables in tp with

RDF terms (resources or literals), yields t . For example, triple t = (& Off 1, vendor ,

& V 1) is a match for triple pattern (? o vendor ? v ) because the substitutions ? o/ & Off 1

and ? v /& V 1 produce a valid triple t in the database. Multiple triple patterns can be

combined using combination operators to form Graph Patterns . A Basic Graph

Pattern is a conjunction of two or more triple patterns and shared variables across

triple patterns implicitly represent equi-joins. The example query Q1 in Figure 6.1b

is a graph pattern with 10 triple patterns with 9 implicit joins (3 with ? v , 3 with ? o ,

2 with ? r , 1 with ? prod ). The answer to a graph pattern is the list of substitutions for

each variable that agree across triple patterns (i.e., meet join conditions).

An important consequence of RDF's fine-grained data model is that query work-

loads contain much larger number of joins than relational workloads. In fact, up to

100 join operations have been reported in queries for some real applications [43].

This number can be further increased if a complete result, including entailed triples,

is desired and inferencing is done as query expansion during processing. The first

challenge in this context is that relational query plan generation strategies will be

confounded by the large search space generated by such numbers of joins. Further,

the heuristics that are used to prune search space such as restricting to only left-linear

plans, are not ideal for RDF queries. In fact, bushy plans are often preferred since

many of the joins are m-way star joins and can be evaluated as merge joins if suitable

sorted indexes exist. Consequently, such queries have a much larger query plan search

space than existing strategies are suited for. A second subtle but critical point is that

cost-based query optimization techniques used for ordering join operations assume

conditions that do not carry over to RDF models. For example, the containment of

value sets assumption that underlies join cardinality estimation exploits the referential

integrity constraint in relational models that allows this assumption to hold in many

relational join scenarios (e.g., join between foreign and primary key fields). However,

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home