Large-Scale RDF Processing with MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

on the semantic web. The core technologies of the semantic web are the Resource

Description Framework (RDF) [1] for representing data in a machine-readable for-

mat and SPARQL [2] for querying RDF data. However, querying RDF data sets at

web scale is challenging, especially because the computation of SPARQL queries

usually requires several joins between subsets of the data. On the other side, classical

single-place machine approaches have reached a point where they cannot scale with

respect to the ever increasing amount of available RDF data (cf. [3]).

The advent of Google's MapReduce programming model [4] in 2004 opened up

new ways for parallel processing of very large data sets distributed over a computer

cluster. Hadoop [5] is the most popular open-source implementation of MapReduce.

In the last few years many companies have built-up their own Hadoop infrastructure,

but there are also ready-to-use cloud services like Amazon's Elastic Compute Cloud

(EC2), offering the Hadoop platform as a service (PaaS). Thus, in contrast to spe-

cialized distributed RDF systems like YARS2 [6] or 4store [7], the use of existing

Hadoop MapReduce infrastructures enables scalable, distributed and fault-tolerant

SPARQL processing out-of-the-box without any additional installation or manage-

ment overhead. However, developing on the MapReduce level is still technically

challenging as it requires profound knowledge about how to program and optimize

Hadoop in an appropriate way. Therefore, Yahoo! developed Pig Latin [8] a language

for the analysis of large data sets based on Hadoop that gives the user a simple level

of abstraction by providing high-level primitives like Filters and Joins .

In the first part of this chapter we describe PigSPARQL [9], a translation frame-

work from full SPARQL 1.0 to Pig Latin, which allows a scalable processing of

SPARQL queries on a MapReduce cluster without any additional programming

efforts. It can be downloaded* and executed on each Hadoop cluster with Apache

Pig installed and benefits from further developments of Apache Pig without chang-

ing a single line of code. The second part of the chapter focuses on an optimized join

technique for selective queries. We present the Map-Side Index Nested Loop Join

(MAPSIN join), which combines the scalable indexing capabilities of the NoSQL

data store HBase with MapReduce for efficient large-scale join processing.

This chapter is structured as follows: Section 5.2 gives an introduction to RDF and

SPARQL, and also provides an overview of distributed processing with MapReduce

(with a special focus on join computation) and PigLatin. Section 5.3 describes the

translation process from SPARQL queries to PigLatin programs. Section 5.5 dis-

cusses experimental results of the PigSPARQL implementation for the SP 2 Bench

SPARQL benchmark [10]. Section 5.6 presents the RDF storage organization for

HBase. Section 5.7 takes another look at join processing with MapReduce and sug-

gests the MAPSIN join technique as a flexible alternative to the commonly used

reduce-side join. Section 5.8 demonstrates the effectiveness of this approach for

selective queries by a comparison of the MAPSIN join with PigSPARQL native joins

for a selection of LUBM [11] benchmark queries. Finally, Section 5.8 gives an over-

view of related work, and Section 5.9 concludes this chapter with a short summary.

* http://dbis.informatik.uni-freiburg.de/PigSPARQL.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home