Database Reference
In-Depth Information
?author foaf:name ?name
OPTIONAL {
?class2 rdfs:subClassOf foaf:Document.
?doc2 rdf:type ?class2.
?doc2 dcterms:issued ?yr2.
?doc2 dc:creator ?author2
FILTER (?author=?author2 && ?yr2 < ?yr)
} FILTER (!bound(?author2))
}
Note that in diagram d of Figure 5.5, we refer to 800 million RDF triples to be able to
compare executions with and without vertical partitioning and 1600 million triples
for comparison with the other queries.
Q3. Select All Articles with Property (a) swrc:pages (b) swrc:month
SELECT ?article
WHERE {
?article rdf:type bench:Article.
?article ?property ?value
(a) FILTER (?property = swrc:pages)
(b) FILTER (?property = swrc:month)
}
Q3. The execution of query Q3a and Q3b requires only one join but generates a
huge amount of intermediate results since the second triple pattern matches all RDF
triples. However, we can observe that the output does not contain the filter variable
?property hence the query can be optimized on algebra level by a filter substitution
where the variable is replaced by its value. This optimization reduces the execution
time of this query by 70% (e)+(g) due to a significant reduction of the reduce shuffle
bytes (f)+(h). A positive side effect of this optimization is the elimination of the
unbounded predicate in the second triple pattern. Thus, using a vertical partitioned
data set, only two predicates must be considered, which results in a significant reduc-
tion of data read from HDFS (opt+part). The filter optimization and the vertical
partitioning reduces the execution time of this query by 97%.
The difference between Q3a and Q3b is the selectivity of the property used in the
filter expression. While the property swrc:pages is rather unselective as it retains
92.61% of all articles, the property swrc:month retains only 0.62% of all articles
[10]. But as we compare the query execution times in (e) and (g), there is not much dif-
ference since the query processing does not really exploit this fact as dangling articles
are discarded in the reduce phase where the join between the two triple patterns is
actually computed. For these kind of very selective patterns, it would be way more effi-
cient to discard the dangling mappings already in the map phase before they are trans-
ferred over the network. We will discuss this in more detail in the following sections.
As an immediate observation our experiments confirm a linear scalability of the
query processing time with respect to the size of the data, a well-known feature of
the MapReduce paradigm. This underlines that PigSPARQL indeed is an effective
Search WWH ::




Custom Search