Relational Techniques for Storing and Querying RDF Data - Advanced Database Query Systems

Database Reference

In-Depth Information

HORIZONTAL STORES

Abadi et al. (2009) have presented SW-Store a new DBMS which is storing RDF data using a fully

decomposed storage model (DSM) (Copeland & Khoshafian, 1985). In this approach, the triples table

is rewritten into n two-column tables where n is the number of unique properties in the data. In each

of these tables, the first column contains the subjects that define that property and the second column

contains the object values for those subjects while the subjects that do not define a particular property

are simply omitted from the table for that property. Each table is sorted by subject, so that particular

subjects can be located quickly, and that fast merge joins can be used to reconstruct information about

multiple properties for subsets of subjects. For a multivalued attribute, each distinct value is listed in

a successive row in the table for that property. One advantage of this approach is that while property

tables need to be carefully constructed so that they are wide enough but not too wide to independently

answer queries, the algorithm for creating tables in the vertically partitioned approach is straightforward

and need not change over time. Moreover, in the property-class schema approach, queries that do not

restrict on class tend to have many union clauses while in the vertically partitioned approach, all data for

a particular property is located in the same table and thus union clauses in queries are less common. The

implementation of SW-Store relies on a column-oriented DBMS, C-store (Stonebraker et al., 2005), to

store tables as collections of columns rather than as collections of rows. In standard row-oriented data-

bases (e.g., Oracle, DB2, SQLServer, Postgres, etc.) entire tuples are stored consecutively. The problem

with this is that if only a few attributes are accessed per query, entire rows need to be read into memory

from disk before the projection can occur. By storing data in columns rather than rows projection occurs

for freeonly those columns relevant to a query need to be read.

Beckmann et al. (2006); Chu et al. (2007) have argued that storing a sparse data set (like RDF) in

multiple tables can cause problems. They suggested storing a sparse data set in a single table while the

complexities of sparse data management can be handled inside an RDBMS with the addition of an inter-

preted storage format. The proposed format starts with a header which contains fields such as relation-id,

tuple-id, and a tuple length. When a tuple has a value for an attribute, the attribute identifier, a length

field (if the type is of variable length), and the value appear in the tuple. The attribute identifier is the

id of the attribute in the system catalog while the attributes that appear in the system catalog but not in

the tuple are null for that tuple. Since the interpreted format stores nothing for null attributes, sparse

data sets in a horizontal schema can in general be stored much more compactly in the format. While

the interpreted format has storage benefits for sparse data, retrieving the values from attributes in tuples

is more complex. In fact, the format is called interpreted because the storage system must discover the

attributes and values of a tuple at tuple-access time, rather than using precompiled position information

from a catalog, as the positional format allows. To tackle this problem, a new operator (called EXTRACT

operator) is introduced to the query plans to precede any reference to attributes stored in the interpreted

format and returns the offsets to the referenced interpreted attribute values which is then used to re-

trieve the values. Value extraction from an interpreted record is a potentially expensive operation that

is dependent on the number attributes stored in a row, or the length of the tuple. Moreover, if a query

evaluation plan fetches each attribute individually and uses an EXTRACT call per attribute, the record

will be scanned for each attribute and will be very slow. Thus, a batch EXTRACT technique is used

to allow one scan of the present values and saves time. Table 1 provides a summary of the relational

techniques for processing RDF queries in terms of their representative approaches and their specific

query optimization techniques.

Advanced Database Query Systems

Search WWH ::

Custom Search

Home