Algebraic Optimization of RDF Graph Pattern Queries on MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

6.3.4 C omPlementary o Ptimization t eChniQues on m aP r eDuCe

There have been other compile-time and run-time techniques to optimize join pro-

cessing on MapReduce. Though some of this work is pursued outside the context of

RDF and SPARQL, they are very relevant to this discussion. Compile-time optimi-

zations include techniques that share scans or results of subexpressions within [28]

or across queries [13,31,35]. A multiway join algorithm [6] has also been proposed

that efficiently partitions and replicates tuples across reducers in a way that mini-

mizes the communication cost as well as the required number of MapReduce cycles.

Run-time optimization techniques such as side-way information passing [19,20] have

also been adapted for Hadoop platforms. These techniques exploit join information

from subqueries to prune out input that is irrelevant to subsequent join operations, thus

reducing the materialization and network transfer costs. Other run-time techniques

[25,26] address data skew problems that result in overloaded reducers impacting the

overall performance. Rather than the timeout approach used in traditional Hadoop,

SkewTune [26] proactively detects skewed jobs and repartitions remaining unprocessed

data into other available nodes. The problem of skew is expected to be common when

processing web-scale RDF data sets since some subsets of subject/property values

occur in very high frequencies than others, for example, the resources related with RDF

schema. A skew-resistant join algorithm [25] that uses bifocal samplings and replicated

join techniques has been proposed for Pig. A comprehensive survey on other available

optimization techniques for MapReduce framework can be found here [27,39].

6.4

SPARQL QUERY COMPILATION ON HADOOP-BASED

PLATFORMS—A CASE STUDY ON APACHE PIG

Hadoop-based extended platforms such as Hive and Pig allow users to express data-

processing tasks using high-level query primitives. For example, Pig provides a high

level data flow language called Pig Latin with which users specify their tasks as a

sequence of data transformation commands, for example, LOAD , SPLIT , JOIN , etc.

A high-level data flow script is automatically translated into logical/physical plan

and a MapReduce (MR) execution plan.

Example 6.1: SPARQL Query in Pig Latin

Suppose we have a query with the two star patterns ( SJ 1 and SJ 2) each with two

triple patterns whose properties are p 1, p 2, and p 3, p 4, respectively. A group of

Pig Latin commands can be used for this purpose, for example, Pig's SPLIT opera-

tor for vertical partitioning of the input relation based on properties, and the JOIN

operator for processing joins in a query, and a Pig Latin version of the query can

be expressed as Program 6.1.

6.4.1 l ogiCal P lan t ranslation

The data flow compilation translates Pig Latin program into a Pig logical plan. LOLoad ,

is used to load the triple relation from HDFS. The next pair of operators LOSplit

Search WWH ::

Custom Search

Home