Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

arrays and variables) and synchronization is explicit (via locks and barriers). Lastly,

as pointed out earlier, sharing of data has to be offered by the underlying distributed

system. Specifically, the underlying distributed system should provide an illusion

that all memories/disks of the computers in the system form a single shared space

addressable by all tasks. A common example of systems that offer such an underly-

ing shared (virtual) address space on a cluster of computers (connected by a LAN)

is denoted as distributed shared memory (DSM) [44,45,70]. A common programing

language that can be used on DSMs and other distributed shared systems is OpenMP

[55].

Other modern examples that employ a shared-memory view/abstraction are

MapReduce and GraphLab. To summarize, the shared-memory programming model

entails two main criteria: (1) developers need not explicitly encode functions that

send/receive messages in their programs, and (2) the underlying storage layer pro-

vides a shared view to all tasks (i.e., tasks can transparently access any location

in the underlying storage). Clearly, MapReduce satisfies the two criteria. In par-

ticular, MapReduce developers write only two sequential functions known as the

map and the reduce functions (i.e., no functions are written or called that explicitly

send and receive messages). In return, MapReduce breaks down the user-defined

map and reduce functions into multiple tasks denoted as map and reduce tasks. All

map tasks are encapsulated in what is known as the map phase, and all reduce tasks

are encompassed in what is called the reduce phase. Subsequently, all communica-

tions occur only between the map and the reduce phases and under the full control

of the engine itself. In addition, any required synchronization is also handled by

the MapReduce engine. For instance, in MapReduce, the user-defined reduce func-

tion cannot be applied before all the map phase output (or intermediate output) are

shuffled, merged, and sorted. Obviously, this requires a barrier between the map and

the reduce phases, which the MapReduce engine internally incorporates. Second,

MapReduce uses the Hadoop Distributed File System (HDFS) [27] as an underly-

ing storage layer. As any typical distributed file system, HDFS provides a shared

abstraction for all tasks, whereby any task can transparently access any location

in HDFS (i.e., as if accesses are local). Therefore, MapReduce is deemed to offer

a shared-memory abstraction provided internally by Hadoop (i.e., the MapReduce

engine and HDFS).

Similar to MapReduce, GraphLab offers a shared-memory abstraction [24,47].

In particular, GraphLab eliminates the need for users to explicitly send/receive mes-

sages in update functions (which represent the user-defined computations in it) and

provides a shared view among vertices in a graph. To elaborate, GraphLab allows

scopes of vertices to overlap and vertices to read and write from and to their scopes.

The scope of a vertex v (denoted as Sv ) is the data stored in v and in all v 's adjacent

edges and vertices. Clearly, this poses potential read-write and write-write conflicts

between vertices sharing scopes. The GraphLab engine (and not the users) synchro-

nizes accesses to shared scopes and ensures consistent parallel execution via sup-

porting three levels of consistency settings, full consistency , edge consistency , and

vertex consistency . Under full consistency, the update function at each vertex has

an exclusive read-write access to its vertex, adjacent edges, and adjacent vertices.

While this guarantees strong consistency and full correctness, it limits parallelism

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home