Algebraic Optimization of RDF Graph Pattern Queries on MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

6.3.3 Q uery P roCessing s ystems on e XtenDeD h aDooP P latForms

A variety of Hadoop-based platforms support extensions to the data flow specifica-

tion layer, the data-processing layer, as well as the underlying storage models as

shown in Figure 6.3. Systems such as Apache Hive and Pig allow users to express

data-processing tasks using high-level query primitives that are automatically

compiled into low-level map and reduce functions. Works such as PigSPARQL

translate a SPARQL query into Pig's high-level data flow language called as Pig

Latin, incorporating basic optimizations such as early filter/projection and the rear-

rangement of triple patterns based on variable counting. Similar to SHARD, this

translation produces one JOIN command in Pig for each join in query. A detailed

description of the query compilation process in Pig will be discussed in the next

section. Extended platforms with indexed access methods have also been proposed

to support efficient random access, which is expensive over HDFS. Hybrid data-

base-Hadoop architectures such as HadoopDB [5] take advantage of the available

indexes as well as the traditional database optimization techniques. HadoopDB

employs data partitioning schemes that allow part of the query evaluation to be

pushed into the database, thus reducing the required number of MapReduce cycles.

HadoopDB's storage and processing layer has also been extended [21] to include

RDF-3X [30], an RDF storage and retrieval system for SPARQL query support.

RDF data is partitioned into multiple nodes with some overlaps, and most query

processing is pushed into each single-node equipped with RDF-3X instead of

RDBMS. Some other works [15,33,40] have proposed RDF storage models and

SPARQL query translation algorithms that exploit distributed databases such as

HBase [1]. The MAPSIN [40] join algorithm uses HBase to selectively retrieve

mapping values for variables in a graph pattern and avoids the need for an expen-

sive reduce phase for join execution. H2RDF [33] indexes data into HBase with

a statistical profile on data sets, and the query engine adaptively selects the most

feasible join algorithm based on query selectivity and the inherent characteristics of

MapReduce and HBase systems. EAGRE [48] introduces an RDF data representa-

tion and layout schemes to efficiently locate RDF triples that match a graph pattern.

Additionally, adaptive scheduling strategies and a consulting protocol are used to

evaluate queries in a way that minimizes disk and network I/O costs, as well as the

total execution time.

Dataflow

specification and

compilation

Pig

Hive

Data

processing

MapReduce

DBMS

HBase

Data

storage

Database

HDFS

FIGURE 6.3

Hadoop-based data processing platforms.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home