Relational Techniques for Storing and Querying RDF Data - Advanced Database Query Systems

Database Reference

In-Depth Information

Therefore, several approaches have been proposed to deal with this limitation by using extensive set of

indexes or by using selectivity estimation information to optimize the join ordering.

Harris & Gibbins (2003) have described the 3store RDF storage system. The storage system of 3Store

is based on a central triple table which holds the hashes for the subject, predicate, object and graph

identifier. The graph identifier equal to zero if the triple resides in the anonymous background graph. A

symbols table is used to allow reverse lookups from the hash to the hashed value, for example, to return

results. Furthermore it allows SQL operations to be performed on pre-computed values in the data types

of the columns without the use of casts. For evaluating SPARQL queries, the triples table is joined once

for each triple in the graph pattern where variables are bound to their values when they encounter the

slot in which the variable appears. Subsequent occurrences of variables in the graph pattern are used to

constrain any appropriate joins with their initial binding. To produce the intermediate results table, the

hashes of any SPARQL variables required to be returned in the results set are projected and the hashes

from the intermediate results table are joined to the symbols table to provide the textual representation

of the results.

Neumann & Weikum (2008) have presented the RDF-3X (RDF Triple eXpress) RDF query engine

which tries to overcome the criticism that triples stores incurs too many expensive self-joins by creat-

ing the exhaustive set of indexes and relying on fast processing of merge joins. The physical design of

RDF-3x is workload-independent and eliminates the need for physical-design tuning by building indexes

over all 6 permutations of the three dimensions that constitute an RDF triple. Additionally, indexes over

count-aggregated variants for all three two-dimensional and all three one-dimensional projections. The

query processor follows RISC-style design philosophy (Chaudhuri & Weikum, 2000) by using the full

set of indexes on the triple tables to rely mostly on merge joins over sorted index lists. The query op-

timizer relies upon its cost model in finding the lowest-cost execution plan and mostly focuses on join

order and the generation of execution plans. In principle, selectivity estimation has a huge impact on

plan generation. While this is a standard problem in database systems, the schema-free nature of RDF

data makes the problem more challenging. RDF-3X employs dynamic programming for plan enumera-

tion, with a cost model based on RDF-specific statistical synopses. It relies on two kinds of statistics:

1) specialized histograms which are generic and can handle any kind of triple patterns and joins. The

disadvantage of histograms is that it assumes independence between predicates. 2) frequent join paths

in the data which give more accurate estimation. During query optimization, the query optimizer uses

the join-path selectivity information when available and otherwise assume independence and use the

histograms information. Neumann & Weikum (2009) have extended the work further by introducing a

runtime technique for accelerating query executions. It uses a light-weight, RDF-specific technique for

sideways information passing across different joins and index scans within the query execution plans.

They have also enhanced the selectivity estimator of the query optimizer by using very fast index lookups

on specifically designed aggregation indexes, rather than relying on the usual kinds of coarse-grained

histograms. This provides much more accurate estimates at compile-time, at a fairly small cost that is

easily amortized by providing better directives for the join-order optimization.

Weiss et al. (2008) have presented the Hexastore RDF storage scheme with main focuses on scalability

and generality in its data storage, processing and representation. Hexastore is based on the idea of indexing

the RDF data in a multiple indexing scheme (Harth & Decker, 2005). It does not discriminate against any

RDF element and treats subjects, properties and objects equally. Each RDF element type have its special

index structures built around it. Moreover, every possible ordering of the importance or precedence of

the three elements in an indexing scheme is materialized. Each index structure in a Hexastore centers

Advanced Database Query Systems

Search WWH ::

Custom Search

Home