Algebraic Optimization of RDF Graph Pattern Queries on MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

these constraints are not present in RDF models. The consequence of this mismatch

in assumptions may not be easily observable when dealing with very small data sets

or data sets generated using synthetic benchmarks like LUBM and BSBM that gener-

ate data fairly systematically, causing such assumptions to hold artificially. However,

when processing large real data sets, this issue becomes very apparent.

Distributed and Parallel Graph Pattern Query Evaluation. Systems such as

Virtuoso [14] use distributed and parallel processing schemes for scalable RDF query

processing on cluster-based architectures. Other systems use a hybrid of MapReduce

for distributed processing and traditional RDF database systems for processing within

a partition [21]. Some others use only MapReduce processing [22,36,41]. Overall, in

most of these existing systems, data partitioning is done using some sort of key range

partitioning [17,18,22,43]. However, [21] uses a structure-based partitioning scheme

that clusters related nodes and edges at a coarser level of granularity—connected

subgraphs. The nature of data partitioning impacts the kind of parallelism that can be

enabled for processing. For example, key range partitioning on a column of a triple

relation will allow join operations on that column to be executed using partitioned

parallelism , where same operator is executed across partitions. If there are multiple

such joins, that is, a star join, then potentially, pipelined parallelism of the join opera-

tions that constitute a star join, could be supported within each partition. For most

existing systems, this is the extent of parallelism that is supported. The implication

is that from a global perspective, the context of every execution is either a single

join or a single star join. This means that when there are multiple star joins there are

multiple partition phases with potential costs of materializing intermediate results

and transporting them across the network. In MapReduce platforms, the cost of each

execution phase is more significant than traditional processing models, so that it is

more critical to focus on having plans with as few phases as possible, that is, shorter

execution workflows. One may consider using structure-based partitioning schemes

such as in [21] to achieve a coarser-grained partitioning, so that complete matches

for large subqueries or even entire queries may be found within individual partitions,

thus minimizing the number of phases needed. However, the need for such special-

ized partitioning only adds to the already significant preprocessing time required by

existing systems for constructing a large number of indexes and statistical profiles

(HadoopRDF [22] reported up to 56 hrs for 3.3 billion triples in systems like RDF-3X

[30]). Significant preprocessing requirements may not be ideal in elastic, on-demand,

leased cloud-usage scenarios because there are financial implications for such times

that may not be recouped in the short terms that services are rented. Further, it delays

time-to-first-result, which is critical in exploratory phases where users want to get a

quick sense before determining exactly what they want to do with the data. In addi-

tion, the goal of keeping execution workflow shorter may conflict with optimization

goals of non-MapReduce cost-based techniques based on just statistics. For exam-

ple, using relational join cardinality estimation techniques, the optimal query plan

selected may be a more linear query plan with a longer MapReduce execution work-

flow than if using number of MapReduce cycles as a primary optimization goal. This

has been echoed in [6,22], and we present an empirical validation in the next section.

To enable fewer MapReduce cycles it will be useful to investigate other kinds

of inter-operator parallelism. HadoopRDF [22] explored this idea by making some

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home