Large-Scale RDF Processing with MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

?author foaf:name ?name

OPTIONAL {

?class2 rdfs:subClassOf foaf:Document.

?doc2 rdf:type ?class2.

?doc2 dcterms:issued ?yr2.

?doc2 dc:creator ?author2

FILTER (?author=?author2 && ?yr2 < ?yr)

} FILTER (!bound(?author2))

}

Note that in diagram d of Figure 5.5, we refer to 800 million RDF triples to be able to

compare executions with and without vertical partitioning and 1600 million triples

for comparison with the other queries.

Q3. Select All Articles with Property (a) swrc:pages (b) swrc:month

SELECT ?article

WHERE {

?article rdf:type bench:Article.

?article ?property ?value

(a) FILTER (?property = swrc:pages)

(b) FILTER (?property = swrc:month)

}

Q3. The execution of query Q3a and Q3b requires only one join but generates a

huge amount of intermediate results since the second triple pattern matches all RDF

triples. However, we can observe that the output does not contain the filter variable

?property hence the query can be optimized on algebra level by a filter substitution

where the variable is replaced by its value. This optimization reduces the execution

time of this query by 70% (e)+(g) due to a significant reduction of the reduce shuffle

bytes (f)+(h). A positive side effect of this optimization is the elimination of the

unbounded predicate in the second triple pattern. Thus, using a vertical partitioned

data set, only two predicates must be considered, which results in a significant reduc-

tion of data read from HDFS (opt+part). The filter optimization and the vertical

partitioning reduces the execution time of this query by 97%.

The difference between Q3a and Q3b is the selectivity of the property used in the

filter expression. While the property swrc:pages is rather unselective as it retains

92.61% of all articles, the property swrc:month retains only 0.62% of all articles

[10]. But as we compare the query execution times in (e) and (g), there is not much dif-

ference since the query processing does not really exploit this fact as dangling articles

are discarded in the reduce phase where the join between the two triple patterns is

actually computed. For these kind of very selective patterns, it would be way more effi-

cient to discard the dangling mappings already in the map phase before they are trans-

ferred over the network. We will discuss this in more detail in the following sections.

As an immediate observation our experiments confirm a linear scalability of the

query processing time with respect to the size of the data, a well-known feature of

the MapReduce paradigm. This underlines that PigSPARQL indeed is an effective

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home