Database Reference
In-Depth Information
these constraints are not present in RDF models. The consequence of this mismatch
in assumptions may not be easily observable when dealing with very small data sets
or data sets generated using synthetic benchmarks like LUBM and BSBM that gener-
ate data fairly systematically, causing such assumptions to hold artificially. However,
when processing large real data sets, this issue becomes very apparent.
Distributed and Parallel Graph Pattern Query Evaluation. Systems such as
Virtuoso [14] use distributed and parallel processing schemes for scalable RDF query
processing on cluster-based architectures. Other systems use a hybrid of MapReduce
for distributed processing and traditional RDF database systems for processing within
a partition [21]. Some others use only MapReduce processing [22,36,41]. Overall, in
most of these existing systems, data partitioning is done using some sort of key range
partitioning [17,18,22,43]. However, [21] uses a structure-based partitioning scheme
that clusters related nodes and edges at a coarser level of granularity—connected
subgraphs. The nature of data partitioning impacts the kind of parallelism that can be
enabled for processing. For example, key range partitioning on a column of a triple
relation will allow join operations on that column to be executed using partitioned
parallelism , where same operator is executed across partitions. If there are multiple
such joins, that is, a star join, then potentially, pipelined parallelism of the join opera-
tions that constitute a star join, could be supported within each partition. For most
existing systems, this is the extent of parallelism that is supported. The implication
is that from a global perspective, the context of every execution is either a single
join or a single star join. This means that when there are multiple star joins there are
multiple partition phases with potential costs of materializing intermediate results
and transporting them across the network. In MapReduce platforms, the cost of each
execution phase is more significant than traditional processing models, so that it is
more critical to focus on having plans with as few phases as possible, that is, shorter
execution workflows. One may consider using structure-based partitioning schemes
such as in [21] to achieve a coarser-grained partitioning, so that complete matches
for large subqueries or even entire queries may be found within individual partitions,
thus minimizing the number of phases needed. However, the need for such special-
ized partitioning only adds to the already significant preprocessing time required by
existing systems for constructing a large number of indexes and statistical profiles
(HadoopRDF [22] reported up to 56 hrs for 3.3 billion triples in systems like RDF-3X
[30]). Significant preprocessing requirements may not be ideal in elastic, on-demand,
leased cloud-usage scenarios because there are financial implications for such times
that may not be recouped in the short terms that services are rented. Further, it delays
time-to-first-result, which is critical in exploratory phases where users want to get a
quick sense before determining exactly what they want to do with the data. In addi-
tion, the goal of keeping execution workflow shorter may conflict with optimization
goals of non-MapReduce cost-based techniques based on just statistics. For exam-
ple, using relational join cardinality estimation techniques, the optimal query plan
selected may be a more linear query plan with a longer MapReduce execution work-
flow than if using number of MapReduce cycles as a primary optimization goal. This
has been echoed in [6,22], and we present an empirical validation in the next section.
To enable fewer MapReduce cycles it will be useful to investigate other kinds
of inter-operator parallelism. HadoopRDF [22] explored this idea by making some
Search WWH ::




Custom Search