Database Reference
In-Depth Information
arrays and variables) and synchronization is explicit (via locks and barriers). Lastly,
as pointed out earlier, sharing of data has to be offered by the underlying distributed
system. Specifically, the underlying distributed system should provide an illusion
that all memories/disks of the computers in the system form a single shared space
addressable by all tasks. A common example of systems that offer such an underly-
ing shared (virtual) address space on a cluster of computers (connected by a LAN)
is denoted as distributed shared memory (DSM) [44,45,70]. A common programing
language that can be used on DSMs and other distributed shared systems is OpenMP
[55].
Other modern examples that employ a shared-memory view/abstraction are
MapReduce and GraphLab. To summarize, the shared-memory programming model
entails two main criteria: (1) developers need not explicitly encode functions that
send/receive messages in their programs, and (2) the underlying storage layer pro-
vides a shared view to all tasks (i.e., tasks can transparently access any location
in the underlying storage). Clearly, MapReduce satisfies the two criteria. In par-
ticular, MapReduce developers write only two sequential functions known as the
map and the reduce functions (i.e., no functions are written or called that explicitly
send and receive messages). In return, MapReduce breaks down the user-defined
mapĀ and reduce functions into multiple tasks denoted as map and reduce tasks. All
map tasks are encapsulated in what is known as the map phase, and all reduce tasks
are encompassed in what is called the reduce phase. Subsequently, all communica-
tions occur only between the map and the reduce phases and under the full control
of the engine itself. In addition, any required synchronization is also handled by
the MapReduce engine. For instance, in MapReduce, the user-defined reduce func-
tion cannot be applied before all the map phase output (or intermediate output) are
shuffled, merged, and sorted. Obviously, this requires a barrier between the map and
the reduce phases, which the MapReduce engine internally incorporates. Second,
MapReduce uses the Hadoop Distributed File System (HDFS) [27] as an underly-
ing storage layer. As any typical distributed file system, HDFS provides a shared
abstraction for all tasks, whereby any task can transparently access any location
in HDFS (i.e., as if accesses are local). Therefore, MapReduce is deemed to offer
a shared-memory abstraction provided internally by Hadoop (i.e., the MapReduce
engine and HDFS).
Similar to MapReduce, GraphLab offers a shared-memory abstraction [24,47].
In particular, GraphLab eliminates the need for users to explicitly send/receive mes-
sages in update functions (which represent the user-defined computations in it) and
provides a shared view among vertices in a graph. To elaborate, GraphLab allows
scopes of vertices to overlap and vertices to read and write from and to their scopes.
The scope of a vertex v (denoted as Sv ) is the data stored in v and in all v 's adjacent
edges and vertices. Clearly, this poses potential read-write and write-write conflicts
between vertices sharing scopes. The GraphLab engine (and not the users) synchro-
nizes accesses to shared scopes and ensures consistent parallel execution via sup-
porting three levels of consistency settings, full consistency , edge consistency , and
vertex consistency . Under full consistency, the update function at each vertex has
an exclusive read-write access to its vertex, adjacent edges, and adjacent vertices.
While this guarantees strong consistency and full correctness, it limits parallelism
Search WWH ::




Custom Search