Database Reference
In-Depth Information
on the semantic web. The core technologies of the semantic web are the Resource
Description Framework (RDF) [1] for representing data in a machine-readable for-
mat and SPARQL [2] for querying RDF data. However, querying RDF data sets at
web scale is challenging, especially because the computation of SPARQL queries
usually requires several joins between subsets of the data. On the other side, classical
single-place machine approaches have reached a point where they cannot scale with
respect to the ever increasing amount of available RDF data (cf. [3]).
The advent of Google's MapReduce programming model [4] in 2004 opened up
new ways for parallel processing of very large data sets distributed over a computer
cluster. Hadoop [5] is the most popular open-source implementation of MapReduce.
In the last few years many companies have built-up their own Hadoop infrastructure,
but there are also ready-to-use cloud services like Amazon's Elastic Compute Cloud
(EC2), offering the Hadoop platform as a service (PaaS). Thus, in contrast to spe-
cialized distributed RDF systems like YARS2 [6] or 4store [7], the use of existing
Hadoop MapReduce infrastructures enables scalable, distributed and fault-tolerant
SPARQL processing out-of-the-box without any additional installation or manage-
ment overhead. However, developing on the MapReduce level is still technically
challenging as it requires profound knowledge about how to program and optimize
Hadoop in an appropriate way. Therefore, Yahoo! developed Pig Latin [8] a language
for the analysis of large data sets based on Hadoop that gives the user a simple level
of abstraction by providing high-level primitives like Filters and Joins .
In the first part of this chapter we describe PigSPARQL [9], a translation frame-
work from full SPARQL 1.0 to Pig Latin, which allows a scalable processing of
SPARQL queries on a MapReduce cluster without any additional programming
efforts. It can be downloaded* and executed on each Hadoop cluster with Apache
Pig installed and benefits from further developments of Apache Pig without chang-
ing a single line of code. The second part of the chapter focuses on an optimized join
technique for selective queries. We present the Map-Side Index Nested Loop Join
(MAPSIN join), which combines the scalable indexing capabilities of the NoSQL
data store HBase with MapReduce for efficient large-scale join processing.
This chapter is structured as follows: Section 5.2 gives an introduction to RDF and
SPARQL, and also provides an overview of distributed processing with MapReduce
(with a special focus on join computation) and PigLatin. Section 5.3 describes the
translation process from SPARQL queries to PigLatin programs. Section 5.5 dis-
cusses experimental results of the PigSPARQL implementation for the SP 2 Bench
SPARQL benchmark [10]. Section 5.6 presents the RDF storage organization for
HBase. Section 5.7 takes another look at join processing with MapReduce and sug-
gests the MAPSIN join technique as a flexible alternative to the commonly used
reduce-side join. Section 5.8 demonstrates the effectiveness of this approach for
selective queries by a comparison of the MAPSIN join with PigSPARQL native joins
for a selection of LUBM [11] benchmark queries. Finally, Section 5.8 gives an over-
view of related work, and Section 5.9 concludes this chapter with a short summary.
* http://dbis.informatik.uni-freiburg.de/PigSPARQL.
Search WWH ::




Custom Search