Algebraic Optimization of RDF Graph Pattern Queries on MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

modifications to Hadoop. Another approach to achieve this is to use algebraic optimi-

zation—rewriting queries using a new set of operators that can achieve the same goal.

In [36], such an approach has been presented. However, this approach has additional

advantages over the approach in HadoopRDF in terms of management of intermediate

results because it uses a nested data model as opposed to the classical relational model.

Also, HadoopRDF, like most systems, requires a preprocessing phase.

This topic chapter will discuss state-of-art techniques that have been proposed for

RDF data processing, particularly graph pattern matching queries, on MapReduce. It

will also cover related techniques that have not yet been applied to RDF query pro-

cessing but are applicable. The chapter will cover algorithms for MapReduce-based

physical operators, Hadoop-only and hybrid computational architectures, query com-

pilation, cost models, and heuristics for cost-based optimization techniques, data

model and algebras for algebraic optimization, and dynamic and adaptive optimization

techniques. We will end the chapter with a short discussion on the feasibility of current

approaches with more complex SPARQL graph pattern fragments, for example, queries

with unbound predicates, optional clauses, or queries with grouping and aggregation.

6.2 DATA PROCESSING USING MapReduce—AN OVERVIEW

Encoding tasks using the MapReduce programming model is done in terms of two

functions:

map ( K 1, V 1) → list( K 2, V 2)

reduce ( K 2, list( V 2)) → ( K 2, V 3)

where map() processes input key-value pairs ( K 1, V 1) and produces intermediate key-

value pairs ( K 2, V 2). The reduce() aggregates the group of values associated with an

intermediate key such as K 2 in this example, to produce an output key-value pair for

each group. MapReduce execution platforms like Hadoop use a master-slave architec-

ture where a master node ( JobTracker ) splits the input file into “chunks” and schedules

m instances of slave nodes called mappers such that each is assigned some chunk to

process, that is, execute the map() function on its input, partition the map output based

on the key and write results to their local disks. Subsequently, the JobTracker sched-

ules r instances of reducers to reduce some assigned partition of mapper output values

in a process that consists of three phases: copy —copying the sorted map output from

mappers' disks to reducer nodes; merge —merging the sorted output lists from the dif-

ferent mappers based on the intermediate key, and reduce —reduce() is invoked once

for each intermediate key, and applied to the associated group of values. The output of

the reduce phase is saved to the Hadoop Distributed File System (HDFS).

Remark 6.1

The dominant cost components in a MapReduce cycle include, M Read —the cost of

mappers reading input; data-shuffling costs that involve M Write —local disk writes at

Search WWH ::

Custom Search

Home