Large-Scale RDF Processing with MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

5.4 PIGSPARQL EVALUATION

We evaluated our implementation on 10 Dell PowerEdge R200 servers connected via

a gigabit network. Each server was equipped with a Dual Core 3.16-GHz processor,

4-GB RAM, 1-TB hard disk, and Hadoop 0.20.2 as well as Pig 0.5.0 installed. Due

to the replication of the distributed file system (HDFS), the actual available payload

was 2.5 TB.

We investigated the execution times, the amount of data read from HDFS ( HDFS

Bytes Read ), the amount of data written to HDFS ( HDFS Bytes Written ), and the

amount of data that was transferred during the shuffle phase ( Reduce Shuffle Bytes ).

We used the SP 2 Bench [10], a SPARQL specific performance benchmark that cov-

ers a wide range of SPARQL features. The SP 2 Bench data generator was used to

produce RDF data sets of up to 1.6 billion triples based on the DBLP library [21]. In

the following, we present the evaluation of three representative and rather complex

SP 2 Bench queries, that cover interesting aspects like queries that involve many joins

or an OPTIONAL with a FILTER for unbounded values.

Q2. Extract All Inproceedings with the Given Properties and

Optional Abstract, Sorted by the Year of Publication

SELECT *

WHERE {

?inproc rdf:type bench:Inproceedings.

?inproc dc:creator ?author.

?inproc bench:booktitle ?booktitle.

?inproc dc:title ?title.

?inproc dcterms:partOf ?proc.

?inproc rdfs:seeAlso ?ee.

?inproc swrc:pages ?page.

?inproc foaf:homepage ?url.

?inproc dcterms:issued ?yr

OPTIONAL { ?inproc bench:abstract ?abstract }

} ORDER BY ?yr

Q2. The left side of the OPTIONAL contains a BGP with nine triple patterns

that requires (without any optimization) eight joins. In addition, the results should be

emitted in a sorted order. Since all eight joins apply to the same variable ?inproc

they can be implemented by a single multijoin (Q2 opt). As a result, the number of

MapReduce jobs that are necessary for executing Q2 is reduced from 12 to 5. The

query also benefits from the vertical partitioning (Q2 opt+part) as all predicates are

bounded, which leads to an overall query execution time reduction of nearly 90% (a).

Q6. This query implements a (closed world) negation by the combination of

OPTIONAL and a FILTER for unbounded values. None of the considered optimiza-

tions on the algebra level is possible for this query. As a consequence, the computation

of the OPTIONAL produces many intermediate results. In fact, 75% of the aggregated

I/O values (diagram d of Figure 5.5) arise in a single MapReduce job (computation of

Search WWH ::

Custom Search

Home