Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

middleware. SOAP defines a scheme for using Extensible Markup Language (XML),

a textual self-describing format, to represent contents of messages and allow distrib-

uted tasks at diverse machines to interact.

In general, code suitable for one machine might not be suitable for another

machine on the cloud, especially when instruction set architectures (ISAs) vary

across machines. Ironically, the virtualization technology, which induces heteroge-

neity, can effectively serve in solving such a problem. Same VMs can be initiated

for a user cluster and mapped to physical machines with different underlying ISAs.

Afterward, the virtualization hypervisor will take care of emulating any difference

between the ISAs of the provisioned VMs and the underlying physical machines

(if any). From a user's perspective, all emulations occur transparently. Lastly, users

can always install their own OSs and libraries on system VMs, like Amazon EC2

instances, thus ensuring homogeneity at the OS and library levels.

Another serious problem that requires a great deal of attention from distributed

programmers is performance variation [20,60] on the cloud. Performance vari-

ation entails that running the same distributed program on the same cluster twice

can result in largely different execution times. It has been observed that execution

times can vary by a factor of 5 for the same application on the same private cluster

[60]. Performance variation is mostly caused by the heterogeneity of clouds imposed

by virtualized environments and resource demand spikes and lulls typically expe-

rienced over time. As a consequence, VMs on clouds rarely carry work at the same

speed, preventing thereby tasks from making progress at (roughly) constant rates.

Clearly, this can create tricky load imbalance and subsequently degrade overall per-

formance. As pointed out earlier, load imbalance makes a program's performance

contingent on its slowest task. Distributed programs can attempt to tackle slow tasks

by detecting them and scheduling corresponding speculative tasks on fast VMs so as

they finish earlier. Specifically, two tasks with the same responsibility can compete

by running at two different VMs, with the one that finishes earlier getting commit-

ted and the other getting killed. For instance, Hadoop MapReduce follows a similar

strategy for solving the same problem, known as speculative execution (see Section

1.5.5). Unfortunately, distinguishing between slow and fast tasks/VMs is very chal-

lenging on the cloud. It could happen that a certain VM running a task is temporar-

ily passing through a demand spike, or it could be the case that the VM is simply

faulty. In theory, not any detectably slow node is faulty and differentiating between

faulty and slow nodes is hard [71]. Because of that, speculative execution in Hadoop

MapReduce does not perform very well in heterogeneous environments [11,26,73].

1.6.2 s Calability

The issue of scalability is a dominant subject in distributed computing. A distributed

program is said to be scalable if it remains effective when the quantities of users,

data and resources are increased significantly. To get a sense of the problem scope

at hand, as per users, in cloud computing, most popular applications and platforms

are currently offered as Internet-based services with millions of users. As per data,

in the time of Big Data, or the Era of Tera as denoted by Intel [13], distributed pro-

grams typically cope with Web-scale data in the order of hundreds and thousands

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home