Database Reference
In-Depth Information
modifications to Hadoop. Another approach to achieve this is to use algebraic optimi-
zation—rewriting queries using a new set of operators that can achieve the same goal.
In [36], such an approach has been presented. However, this approach has additional
advantages over the approach in HadoopRDF in terms of management of intermediate
results because it uses a nested data model as opposed to the classical relational model.
Also, HadoopRDF, like most systems, requires a preprocessing phase.
This topic chapter will discuss state-of-art techniques that have been proposed for
RDF data processing, particularly graph pattern matching queries, on MapReduce. It
will also cover related techniques that have not yet been applied to RDF query pro-
cessing but are applicable. The chapter will cover algorithms for MapReduce-based
physical operators, Hadoop-only and hybrid computational architectures, query com-
pilation, cost models, and heuristics for cost-based optimization techniques, data
model and algebras for algebraic optimization, and dynamic and adaptive optimization
techniques. We will end the chapter with a short discussion on the feasibility of current
approaches with more complex SPARQL graph pattern fragments, for example, queries
with unbound predicates, optional clauses, or queries with grouping and aggregation.
6.2 DATA PROCESSING USING MapReduce—AN OVERVIEW
Encoding tasks using the MapReduce programming model is done in terms of two
functions:
map ( K 1, V 1) → list( K 2, V 2)
reduce ( K 2, list( V 2)) → ( K 2, V 3)
where map() processes input key-value pairs ( K 1, V 1) and produces intermediate key-
value pairs ( K 2, V 2). The reduce() aggregates the group of values associated with an
intermediate key such as K 2 in this example, to produce an output key-value pair for
each group. MapReduce execution platforms like Hadoop use a master-slave architec-
ture where a master node ( JobTracker ) splits the input file into “chunks” and schedules
m instances of slave nodes called mappers such that each is assigned some chunk to
process, that is, execute the map() function on its input, partition the map output based
on the key and write results to their local disks. Subsequently, the JobTracker sched-
ules r instances of reducers to reduce some assigned partition of mapper output values
in a process that consists of three phases: copy —copying the sorted map output from
mappers' disks to reducer nodes; merge —merging the sorted output lists from the dif-
ferent mappers based on the intermediate key, and reduce —reduce() is invoked once
for each intermediate key, and applied to the associated group of values. The output of
the reduce phase is saved to the Hadoop Distributed File System (HDFS).
Remark 6.1
The dominant cost components in a MapReduce cycle include, M Read —the cost of
mappers reading input; data-shuffling costs that involve M Write —local disk writes at
Search WWH ::




Custom Search