Large-Scale RDF Processing with MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

is a distributed, scalable, and strictly consistent column-oriented NoSQL data store,

inspired by Google's Bigtable [22] and well integrated into Hadoop. Hadoop's dis-

tributed file system, HDFS, is designed for sequential reads and writes of very large

files in a batch processing manner but lacks the ability to access data randomly in

close to real time. HBase can be seen as an additional storage layer on top of HDFS

that supports efficient random access. The data model of HBase corresponds to a

sparse multidimensional sorted map with the following access pattern:

( Table, RowKey, Family, Column, Timestamp ) → Value

The rows of a table are sorted and indexed according to their row key and every

row can have an arbitrary number of columns . Columns are grouped into column

families and column values (denoted as cell) are timestamped and thus support mul-

tiple versions. HBase tables are dynamically split into regions of contiguous row

ranges with a configured maximum size. When a region becomes too large, it is auto-

matically split into two regions at the middle key (auto-sharding). However, HBase

has neither a declarative query language nor built-in support for native join process-

ing, leaving higher-level data transformations to the overlying application layer. In

our approach we propose a map-side join strategy that leverages the implicit index

capabilities of HBase to overcome the usual restrictions of map-side joins as outlined

in Section 5.2.2.

In [23], the authors adopted the idea of Hexastore [24] to index all possible order-

ings of an RDF triple for storing RDF data in HBase. This results in six tables in

HBase allowing to retrieve results for any possible SPARQL triple pattern with a

single lookup on one of the tables (except for a triple pattern with three variables).

However, as HDFS has a default replication factor of three and data in HBase is

stored in files on HDFS, an RDF data set is actually stored 18 times using this

schema. But it's not only about storage space, also loading a web-scale RDF data set

into HBase becomes very costly and consumes many resources. Our storage schema

for RDF data in HBase is inspired by [25] and uses only two tables, T s_po and T o_ps .

We extend the schema with a triple pattern mapping that leverages the power of

predicate push-down filters in HBase to overcome possible performance shortcom-

ings of a two table schema. Furthermore, we improve the scalibility of the schema by

introducing a modified row key design for class assignments in RDF, which would

otherwise lead to overloaded regions constraining both scalability and performance.

In a T s_po table, an RDF triple is stored using the subject as row key, the predicate

as column name and the object as column value. If a subject has more than one object

for a given predicate (e.g., an article having more than one author), these objects are

stored as different versions in the same column. The notation T s_po indicates that the

table is indexed by subject. Table T o_ps follows the same design. In both tables there is

only one single column family that contains all columns. Tables 5.2 and 5.3 illustrate

the corresponding tables for the RDF graph in Figure 5.6.

At first glance, this storage schema seems to have performance drawbacks when

compared with the six table schema in [23] since there are only indexes for sub-

jects and objects. However, we can use the HBase Filter API to specify additional

column filters for table index lookups. These filters are applied directly on server

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home