Algebraic Optimization of RDF Graph Pattern Queries on MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

integrated various optimization techniques for efficient query processing, for example,

histograms, summary statistics, dictionary mapping, and sideway information passing

with comprehensive indexes. Furthermore, bushy execution plans and compressed bit-

map data structures have been adopted [8,16,45]. However, as shown in [22], most single-

node systems require huge main memory and a significant loading/preprocessing time.

6.3.1 D istributeD rDF Q uery P roCessing s ystems

Distributed RDF databases [14,17,18,43] have been studied to improve the scal-

ability and fault-tolerance of RDF query processing systems. Generally, they are

also equipped with exhaustive indexes, which are distributed based on key ranges

(e.g., S in SPO index) or a global hash/randomized function for load balancing.

However, such systems have typically been tested with a relatively small number of

nodes and limited size of data sets, for example, 16 nodes and approximately 600

GB data sets in YARS2 [18]. As an alternative, graph-oriented approaches [42,47]

have been explored to manipulate RDF data sets based on graph-based approaches

in a distributed environment. However, the performance of such systems still relies

on large memory with expensive network devices, for example, 96 GB RAM and

InfiniBand network in Trinity [47]. Recently, extensive efforts have been made to

leverage commodity-grade machines on generic cloud platforms to process massive

amounts of RDF data sets. In particular, the MapReduce framework has been popu-

larly used to build scalable RDF processing systems. The main difference across

such systems is the translation strategy from relational plans to MapReduce plans. In

this section, we mainly discuss existing RDF query processing systems and optimi-

zation techniques that have been developed for vanilla Hadoop as well as Hadoop-

based extended platforms.

6.3.2 Q uery P roCessing s ystems on v anilla h aDooP P latForm

A common strategy for evaluating RDF graph pattern queries on MapReduce is

translating queries to corresponding relational-style plans, for example, SHARD

[38], PigSPARQL [41], HadoopRDF [22]. SHARD employs a MapReduce execution

plan in which the first MapReduce job preprocesses the data (groups related triples),

while the subsequent MapReduce jobs implement a clause-iteration evaluation tech-

nique that executes each join with a triple pattern in a separate MapReduce job. A

final MapReduce job then projects the specified columns, leading to a total of ( n + 1)

cycles for a query with n triple patterns that is, ( n − 1 joins). SHARD has a fairly lim-

ited support of the SPARQL language—as of time of this writing, does not support

FILTER operations. Some research efforts have developed optimization techniques

that attempt to reduce cost by minimizing the number of MapReduce jobs required to

process queries [22]. HadoopRDF preprocesses data into a storage model consisting

of files of predicate-object splits of data. This allows for selective retrieval of splits

based on predicates and objects in a query. While queries are interpreted using the

traditional relational-like algebra, it employs a greedy strategy to find an optimal

grouping of “non-conflicting” join operators in the same cycle. This approach results

in a shorter MapReduce workflow length than the traditional approach.

Search WWH ::

Custom Search

Home