Database Reference
In-Depth Information
5.4 PIGSPARQL EVALUATION
We evaluated our implementation on 10 Dell PowerEdge R200 servers connected via
a gigabit network. Each server was equipped with a Dual Core 3.16-GHz processor,
4-GB RAM, 1-TB hard disk, and Hadoop 0.20.2 as well as Pig 0.5.0 installed. Due
to the replication of the distributed file system (HDFS), the actual available payload
was 2.5 TB.
We investigated the execution times, the amount of data read from HDFS ( HDFS
Bytes Read ), the amount of data written to HDFS ( HDFS Bytes Written ), and the
amount of data that was transferred during the shuffle phase ( Reduce Shuffle Bytes ).
We used the SP 2 Bench [10], a SPARQL specific performance benchmark that cov-
ers a wide range of SPARQL features. The SP 2 Bench data generator was used to
produce RDF data sets of up to 1.6 billion triples based on the DBLP library [21]. In
the following, we present the evaluation of three representative and rather complex
SP 2 Bench queries, that cover interesting aspects like queries that involve many joins
or an OPTIONAL with a FILTER for unbounded values.
Q2. Extract All Inproceedings with the Given Properties and
Optional Abstract, Sorted by the Year of Publication
SELECT *
WHERE {
?inproc rdf:type bench:Inproceedings.
?inproc dc:creator ?author.
?inproc bench:booktitle ?booktitle.
?inproc dc:title ?title.
?inproc dcterms:partOf ?proc.
?inproc rdfs:seeAlso ?ee.
?inproc swrc:pages ?page.
?inproc foaf:homepage ?url.
?inproc dcterms:issued ?yr
OPTIONAL { ?inproc bench:abstract ?abstract }
} ORDER BY ?yr
Q2. The left side of the OPTIONAL contains a BGP with nine triple patterns
that requires (without any optimization) eight joins. In addition, the results should be
emitted in a sorted order. Since all eight joins apply to the same variable ?inproc
they can be implemented by a single multijoin (Q2 opt). As a result, the number of
MapReduce jobs that are necessary for executing Q2 is reduced from 12 to 5. The
query also benefits from the vertical partitioning (Q2 opt+part) as all predicates are
bounded, which leads to an overall query execution time reduction of nearly 90% (a).
Q6. This query implements a (closed world) negation by the combination of
OPTIONAL and a FILTER for unbounded values. None of the considered optimiza-
tions on the algebra level is possible for this query. As a consequence, the computation
of the OPTIONAL produces many intermediate results. In fact, 75% of the aggregated
I/O values (diagram d of Figure 5.5) arise in a single MapReduce job (computation of
Search WWH ::




Custom Search