Large-Scale RDF Processing with MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

TABLE 5.4

SPARQL Triple Pattern Mapping Using HBase

Predicate Push-Down Filters

Pattern

Table

Filter

(s, p, o)

T s_po or T o_ps

Column and value

(?s, p, o)

T o_ps

Column

(s, ?p, o)

T s_po or T o_ps

Value

(s, p, ?o)

T s_po

Column

(?s, ?p, o)

T o_ps

(?s, p, ?o)

T s_po or T o_ps (table scan)

Column

(s, ?p, ?o)

T s_po

(?s, ?p, ?o)

T s_po or T o_ps (table scan)

has similar performance characteristics compared to the six-table schema but uses

only one third of storage space.

Our experiments also revealed some fundamental scaling limitations of the stor-

age schema caused by T o_ps . In general, an RDF data set uses a relatively small num-

ber of classes but contains many triples that link resources to classes, for example,

(Alex, type, Person). Thus, using the object of a triple as row key means that all

resources of the same class will be stored in the same row. With increasing data set

size these rows become very large and exceed the configured maximum region size

resulting in overloaded regions that contain only a single row. Since HBase cannot

split these regions, the resources of a single machine become a bottleneck for scal-

ability. To circumvent this problem we use a modified T o_ps row key design for triples

with predicate type. Instead of using the object as row key we use a compound row

key of object and subject, for example, (PersonjAlex). As a result, we cannot access

all resources of a class with a single table lookup, but as the corresponding rows will

be consecutive in T o_ps , we can use an efficient range scan starting at the first entry

of the class.

5.6 MAPSIN JOIN

The indexing capabilities of HBase lay the foundation for our Map-Side Index

Nested Loop Join (MAPSIN) that improves the query performance of selective que-

ries. This allows us to retain the flexibility of reduce-side joins while utilizing the

effectiveness of a map-side join without any changes to the underlying frameworks.

We start the discussion by introducing the base case of our join technique followed

by our strategy for cascading a sequence of joins. To the end, we will propose opti-

mizations for multiway joins and one-pattern queries.

5.6.1 b ase C ase

To compute the join between two triple patterns, p 1 ⋈ p 2 , we have to merge the

compatible mappings for p 1 and p 2 . Therefore, it is necessary that subsets of both

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home