Database Reference
In-Depth Information
6.3.4 C omPlementary o Ptimization t eChniQues on m aP r eDuCe
There have been other compile-time and run-time techniques to optimize join pro-
cessing on MapReduce. Though some of this work is pursued outside the context of
RDF and SPARQL, they are very relevant to this discussion. Compile-time optimi-
zations include techniques that share scans or results of subexpressions within [28]
or across queries [13,31,35]. A multiway join algorithm [6] has also been proposed
that efficiently partitions and replicates tuples across reducers in a way that mini-
mizes the communication cost as well as the required number of MapReduce cycles.
Run-time optimization techniques such as side-way information passing [19,20] have
also been adapted for Hadoop platforms. These techniques exploit join information
from subqueries to prune out input that is irrelevant to subsequent join operations, thus
reducing the materialization and network transfer costs. Other run-time techniques
[25,26] address data skew problems that result in overloaded reducers impacting the
overall performance. Rather than the timeout approach used in traditional Hadoop,
SkewTune [26] proactively detects skewed jobs and repartitions remaining unprocessed
data into other available nodes. The problem of skew is expected to be common when
processing web-scale RDF data sets since some subsets of subject/property values
occur in very high frequencies than others, for example, the resources related with RDF
schema. A skew-resistant join algorithm [25] that uses bifocal samplings and replicated
join techniques has been proposed for Pig. A comprehensive survey on other available
optimization techniques for MapReduce framework can be found here [27,39].
6.4
SPARQL QUERY COMPILATION ON HADOOP-BASED
PLATFORMS—A CASE STUDY ON APACHE PIG
Hadoop-based extended platforms such as Hive and Pig allow users to express data-
processing tasks using high-level query primitives. For example, Pig provides a high
level data flow language called Pig Latin with which users specify their tasks as a
sequence of data transformation commands, for example, LOAD , SPLIT , JOIN , etc.
A high-level data flow script is automatically translated into logical/physical plan
and a MapReduce (MR) execution plan.
Example 6.1: SPARQL Query in Pig Latin
Suppose we have a query with the two star patterns ( SJ 1 and SJ 2) each with two
triple patterns whose properties are p 1, p 2, and p 3, p 4, respectively. A group of
Pig Latin commands can be used for this purpose, for example, Pig's SPLIT opera-
tor for vertical partitioning of the input relation based on properties, and the JOIN
operator for processing joins in a query, and a Pig Latin version of the query can
be expressed as Program 6.1.
6.4.1 l ogiCal P lan t ranslation
The data flow compilation translates Pig Latin program into a Pig logical plan. LOLoad ,
is used to load the triple relation from HDFS. The next pair of operators LOSplit
Search WWH ::




Custom Search