Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

proposed an algorithm that generates a query plan whose cost is bounded by the log

of the total number of variables in the given SPARQL query. An approach for opti-

mizing RDF graph pattern matching by reinterpreting certain join tree structures as

grouping operations have been presented in [81,117]. The proposed approach repre-

sents the intermediate results as sets of groups of triples called TripleGroups and

uses Nested TripleGroup Algebra for manipulating them. Abouzied et al. [4] have

demonstrated an approach for storing and querying RDF data using the HadoopDB

system in conjunction with a column-oriented database [2] that can provide a prom-

ising solution for supporting efficient and scalable semantic web applications. A

similar approach has been presented in [71] where it replaced the column-oriented

backend database with the state-of-the-art of RDF query processors, RDF-3X [106].

Surfer [34] is a large-scale graph-processing engine that is designed to provide

two basic primitives for programmers: MapReduce and Propagation . In this engine,

MapReduce processes different key value pairs in parallel, and propagation is an iter-

ative computational pattern that transfers information along the edges from a vertex

to its neighbors in the graph. In principle, these two primitives are complementary

in graph processing where MapReduce is suitable for processing flat data structures

(e.g., vertex-oriented tasks) while propagation is optimized for edge-oriented tasks

on partitioned graphs. Lattanzi et al. [87] presented a set of MapReduce-based algo-

rithms for a variety of fundamental graph problems such as minimum spanning

trees, maximal matchings, approximate weighted matchings, approximate vertex

and edge covers, and minimum cuts. All of the presented algorithms are parameter-

ized by the amount of memory available on the machines that are used to determine

the number of MapReduce rounds.

In general, graph algorithms can be written as a series of chained MapReduce

invocations that requires passing the entire state of the graph from one stage to the

next. However, this approach is ill-suited for graph processing and can lead to sub-

optimal performance due to the additional communication and associated serialization

overhead in addition to the need of coordinating the steps of a chained MapReduce.

The Pregel system [98] has been introduced by Google as scalable platform for imple-

menting graph algorithms. It relies on a vertex-centric approach, which is inspired

by the Bulk Synchronous Parallel model (BSP) [127], where programs are expressed

as a sequence of iterations, in each of which a vertex can receive messages sent in

the previous iteration, send messages to other vertices, and modify its own state as

well as that of its outgoing edges or mutate graph topology. In particular, Pregel

computations consist of a sequence of iterations, called supersteps . During a super-

step the framework invokes a user-defined function for each vertex, conceptually in

parallel, which specifies the behavior at a single vertex V and a single superstep S . It

can read messages sent to V in superstep S - 1, send messages to other vertices that

will be received at superstep S + 1, and modify the state of V and its outgoing edges.

Messages are typically sent along outgoing edges, but a message may be sent to any

vertex whose identifier is known (Figure 2.19). Similar to the MapReduce framework,

Pregel has been designed to be an efficient, scalable, and fault-tolerant implementa-

tion on clusters of thousands of commodity computers where the distribution-related

details are hidden behind an abstract. It keeps vertices and edges on the machine

that performs computation and uses network transfers only for messages. Hence, the

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home