Database Reference
In-Depth Information
HORIZONTAL STORES
Abadi et al. (2009) have presented SW-Store a new DBMS which is storing RDF data using a fully
decomposed storage model (DSM) (Copeland & Khoshafian, 1985). In this approach, the triples table
is rewritten into n two-column tables where n is the number of unique properties in the data. In each
of these tables, the first column contains the subjects that define that property and the second column
contains the object values for those subjects while the subjects that do not define a particular property
are simply omitted from the table for that property. Each table is sorted by subject, so that particular
subjects can be located quickly, and that fast merge joins can be used to reconstruct information about
multiple properties for subsets of subjects. For a multivalued attribute, each distinct value is listed in
a successive row in the table for that property. One advantage of this approach is that while property
tables need to be carefully constructed so that they are wide enough but not too wide to independently
answer queries, the algorithm for creating tables in the vertically partitioned approach is straightforward
and need not change over time. Moreover, in the property-class schema approach, queries that do not
restrict on class tend to have many union clauses while in the vertically partitioned approach, all data for
a particular property is located in the same table and thus union clauses in queries are less common. The
implementation of SW-Store relies on a column-oriented DBMS, C-store (Stonebraker et al., 2005), to
store tables as collections of columns rather than as collections of rows. In standard row-oriented data-
bases (e.g., Oracle, DB2, SQLServer, Postgres, etc.) entire tuples are stored consecutively. The problem
with this is that if only a few attributes are accessed per query, entire rows need to be read into memory
from disk before the projection can occur. By storing data in columns rather than rows projection occurs
for freeonly those columns relevant to a query need to be read.
Beckmann et al. (2006); Chu et al. (2007) have argued that storing a sparse data set (like RDF) in
multiple tables can cause problems. They suggested storing a sparse data set in a single table while the
complexities of sparse data management can be handled inside an RDBMS with the addition of an inter-
preted storage format. The proposed format starts with a header which contains fields such as relation-id,
tuple-id, and a tuple length. When a tuple has a value for an attribute, the attribute identifier, a length
field (if the type is of variable length), and the value appear in the tuple. The attribute identifier is the
id of the attribute in the system catalog while the attributes that appear in the system catalog but not in
the tuple are null for that tuple. Since the interpreted format stores nothing for null attributes, sparse
data sets in a horizontal schema can in general be stored much more compactly in the format. While
the interpreted format has storage benefits for sparse data, retrieving the values from attributes in tuples
is more complex. In fact, the format is called interpreted because the storage system must discover the
attributes and values of a tuple at tuple-access time, rather than using precompiled position information
from a catalog, as the positional format allows. To tackle this problem, a new operator (called EXTRACT
operator) is introduced to the query plans to precede any reference to attributes stored in the interpreted
format and returns the offsets to the referenced interpreted attribute values which is then used to re-
trieve the values. Value extraction from an interpreted record is a potentially expensive operation that
is dependent on the number attributes stored in a row, or the length of the tuple. Moreover, if a query
evaluation plan fetches each attribute individually and uses an EXTRACT call per attribute, the record
will be scanned for each attribute and will be very slow. Thus, a batch EXTRACT technique is used
to allow one scan of the present values and saves time. Table 1 provides a summary of the relational
techniques for processing RDF queries in terms of their representative approaches and their specific
query optimization techniques.
Search WWH ::




Custom Search