Large-Scale RDF Processing with MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

5.7 MAPSIN EVALUATION

The evaluation was performed on the same cluster used for the evaluation in Section

5.4, but we increased the RAM configuration of every server to 8 GB since HBase

consumes a lot of RAM. We used HBase in the version 0.90.4.

We used the well-known Lehigh University Benchmark (LUBM) [11] as the que-

ries can easily be formulated as SPARQL basic graph patterns. The generated data

sets ranged from 1000 up to 3000 universities using the WebPIE inference engine

for Hadoop [26] to precompute the transitive closure. The loading times for both

tables T s_po and T o_ps as well as all data sets are listed in Table 5.5. We illustrate the

performance comparison of PigSPARQL and MAPSIN for some selected LUBM

queries that represent the different query types in Figure 5.8. Our proof-of-concept

implementation is currently limited to a maximum number of two join variables

as the goal was to demonstrate the feasibility of the approach for selective queries

rather than supporting all possible BGP constellations. For detailed comparison, the

runtimes of all executed queries are listed in Table 5.6.

LUBM queries Q1, Q3, Q5, Q11, Q13 demonstrate the base case with a single join

between two triple patterns (cf. Figure 5.8a). MAPSIN joins performed 8 to 13 times

faster compared to the reduce-side joins of PigSPARQL. Furthermore, the perfor-

mance gain increases with the size of the data set.

LUBM queries Q4 (5 triple patterns), Q7 (4 triple patterns), Q8 (5 triple pat-

terns) demonstrate the more general case with a sequence of cascaded joins (cf.

Figure 5.8b). In these cases, MAPSIN joins perform up to 28 times faster than

PigSPARQL. Of particular interest is query Q4 of LUBM, since it supports the

multiway join optimization outlined in Section 5.6.3, as all triple patterns share the

same join variable. This kind of optimization is also supported by PigSPARQL such

that both approaches can compute the query results with a single multiway join (cf.

Figure 5.8c). The MAPSIN multiway join optimization improves the basic MAPSIN

join execution time by a factor of 3.3 (LUBM Q4), independently of the data size.

Moreover, the MAPSIN multiway join optimization performs 19 to 28 times faster

than the reduce-side based multiway join implementation of PigSPARQL.

The remaining queries (LUBM Q6, Q14) consist of only one single triple pattern.

Consequently, they do not contain a join processing step and illustrate primarily the

advantages of the distributed HBase table scan compared with the HDFS storage

TABLE 5.5

LUBM Loading Times for Tables T s_po and T o_ps (hh:mm:ss)

LUBM

1000

1500

2000

2500

3000

# RDF triples

~210 million

~315 million

~420 million

~525 million

~630 million

T s_po

00:28:50

00:42:10

00:52:03

00:56:00

01:05:25

T o_ps

00:48:57

01:14:59

01:21:53

01:38:52

01:34:22

Total

01:17:47

01:57:09

02:13:56

02:34:52

02:39:47

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home