Database Reference
In-Depth Information
is a distributed, scalable, and strictly consistent column-oriented NoSQL data store,
inspired by Google's Bigtable [22] and well integrated into Hadoop. Hadoop's dis-
tributed file system, HDFS, is designed for sequential reads and writes of very large
files in a batch processing manner but lacks the ability to access data randomly in
close to real time. HBase can be seen as an additional storage layer on top of HDFS
that supports efficient random access. The data model of HBase corresponds to a
sparse multidimensional sorted map with the following access pattern:
( Table, RowKey, Family, Column, Timestamp ) → Value
The rows of a table are sorted and indexed according to their row key and every
row can have an arbitrary number of columns . Columns are grouped into column
families and column values (denoted as cell) are timestamped and thus support mul-
tiple versions. HBase tables are dynamically split into regions of contiguous row
ranges with a configured maximum size. When a region becomes too large, it is auto-
matically split into two regions at the middle key (auto-sharding). However, HBase
has neither a declarative query language nor built-in support for native join process-
ing, leaving higher-level data transformations to the overlying application layer. In
our approach we propose a map-side join strategy that leverages the implicit index
capabilities of HBase to overcome the usual restrictions of map-side joins as outlined
in Section 5.2.2.
In [23], the authors adopted the idea of Hexastore [24] to index all possible order-
ings of an RDF triple for storing RDF data in HBase. This results in six tables in
HBase allowing to retrieve results for any possible SPARQL triple pattern with a
single lookup on one of the tables (except for a triple pattern with three variables).
However, as HDFS has a default replication factor of three and data in HBase is
stored in files on HDFS, an RDF data set is actually stored 18 times using this
schema. But it's not only about storage space, also loading a web-scale RDF data set
into HBase becomes very costly and consumes many resources. Our storage schema
for RDF data in HBase is inspired by [25] and uses only two tables, T s_po and T o_ps .
We extend the schema with a triple pattern mapping that leverages the power of
predicate push-down filters in HBase to overcome possible performance shortcom-
ings of a two table schema. Furthermore, we improve the scalibility of the schema by
introducing a modified row key design for class assignments in RDF, which would
otherwise lead to overloaded regions constraining both scalability and performance.
In a T s_po table, an RDF triple is stored using the subject as row key, the predicate
as column name and the object as column value. If a subject has more than one object
for a given predicate (e.g., an article having more than one author), these objects are
stored as different versions in the same column. The notation T s_po indicates that the
table is indexed by subject. Table T o_ps follows the same design. In both tables there is
only one single column family that contains all columns. Tables 5.2 and 5.3 illustrate
the corresponding tables for the RDF graph in Figure 5.6.
At first glance, this storage schema seems to have performance drawbacks when
compared with the six table schema in [23] since there are only indexes for sub-
jects and objects. However, we can use the HBase Filter API to specify additional
column filters for table index lookups. These filters are applied directly on server
Search WWH ::




Custom Search