Database Reference
In-Depth Information
integrated various optimization techniques for efficient query processing, for example,
histograms, summary statistics, dictionary mapping, and sideway information passing
with comprehensive indexes. Furthermore, bushy execution plans and compressed bit-
map data structures have been adopted [8,16,45]. However, as shown in [22], most single-
node systems require huge main memory and a significant loading/preprocessing time.
6.3.1 D istributeD rDF Q uery P roCessing s ystems
Distributed RDF databases [14,17,18,43] have been studied to improve the scal-
ability and fault-tolerance of RDF query processing systems. Generally, they are
also equipped with exhaustive indexes, which are distributed based on key ranges
(e.g.,  S  in SPO index) or a global hash/randomized function for load balancing.
However, such systems have typically been tested with a relatively small number of
nodes and limited size of data sets, for example, 16 nodes and approximately 600
GB data sets in YARS2 [18]. As an alternative, graph-oriented approaches [42,47]
have been explored to manipulate RDF data sets based on graph-based approaches
in a distributed environment. However, the performance of such systems still relies
on large memory with expensive network devices, for example, 96 GB RAM and
InfiniBand network in Trinity [47]. Recently, extensive efforts have been made to
leverage commodity-grade machines on generic cloud platforms to process massive
amounts of RDF data sets. In particular, the MapReduce framework has been popu-
larly used to build scalable RDF processing systems. The main difference across
such systems is the translation strategy from relational plans to MapReduce plans. In
this section, we mainly discuss existing RDF query processing systems and optimi-
zation techniques that have been developed for vanilla Hadoop as well as Hadoop-
based extended platforms.
6.3.2 Q uery P roCessing s ystems on v anilla h aDooP P latForm
A common strategy for evaluating RDF graph pattern queries on MapReduce is
translating queries to corresponding relational-style plans, for example, SHARD
[38], PigSPARQL [41], HadoopRDF [22]. SHARD employs a MapReduce execution
plan in which the first MapReduce job preprocesses the data (groups related triples),
while the subsequent MapReduce jobs implement a clause-iteration evaluation tech-
nique that executes each join with a triple pattern in a separate MapReduce job. A
final MapReduce job then projects the specified columns, leading to a total of ( n + 1)
cycles for a query with n triple patterns that is, ( n − 1 joins). SHARD has a fairly lim-
ited support of the SPARQL language—as of time of this writing, does not support
FILTER operations. Some research efforts have developed optimization techniques
that attempt to reduce cost by minimizing the number of MapReduce jobs required to
process queries [22]. HadoopRDF preprocesses data into a storage model consisting
of files of predicate-object splits of data. This allows for selective retrieval of splits
based on predicates and objects in a query. While queries are interpreted using the
traditional relational-like algebra, it employs a greedy strategy to find an optimal
grouping of “non-conflicting” join operators in the same cycle. This approach results
in a shorter MapReduce workflow length than the traditional approach.
Search WWH ::




Custom Search