Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management - page 7

Database Reference

In-Depth Information

over distributed systems (e.g., the cloud). A distributed programming model defines

how easily and efficiently algorithms can be specified as distributed programs.

For instance, a distributed programming model that highly abstracts architectural/

hardware details, automatically parallelizes and distributes computation, and trans-

parently supports fault tolerance is deemed an easy-to-use programming model. The

efficiency of the model, however, depends on the effectiveness of the techniques that

underlie the model. There are two classical distributed programming models that are

in wide use, shared memory and message passing . The two models fulfill different

needs and suit different circumstances. Nonetheless, they are elementary in a sense

that they only provide a basic interaction model for distributed tasks and lack any

facility to automatically parallelize and distribute tasks or tolerate faults. Recently,

there have been other advanced models that address the inefficiencies and challenges

posed by the shared-memory and the message-passing models, especially upon port-

ing them to the cloud. Among these models are MapReduce [17], Pregel [49], and

GraphLab [47]. These models are built upon the shared-memory and the message-

passing programming paradigms, yet are more involved and offer various properties

that are essential for the cloud. As these models highly differ from the traditional

ones, we refer to them as distributed analytics engines .

1.5.2.1 The Shared-Memory Programming Model

In the shared-memory programming model, tasks can communicate by reading

and writing to shared memory (or disk) locations. Thus, the abstraction provided

by the shared-memory model is that tasks can access any location in the distributed

memories/disks. This is similar to threads of a single process in operating systems,

whereby all threads share the process address space and communicate by reading

and writing to that space (see Figure 1.4). Therefore, with shared-memory, data is not

explicitly communicated but implicitly exchanged via sharing. Due to sharing, the

shared-memory programming model entails the usage of synchronization mecha-

nisms within distributed programs. Synchronization is needed to control the order

in which read/write operations are performed by various tasks. In particular, what

is required is that distributed tasks are prevented from simultaneously writing to a

shared data, so as to avoid corrupting the data or making it inconsistent. This can

be typically achieved using semaphores , locks , and/or barriers . A semaphore is

a point-to-point synchronization mechanism that involves two parallel/distributed

Spawn

S1

T1

T2

T3

T4

Join

S2

Shared address space

FIGURE 1.4 Tasks running in parallel and sharing an address space through which they

can communicate.

Next Page

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home