Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

if done injudiciously, might increase communication overhead, impede scalability,

and potentially degrade performance. In fact, with datacenters hosting thousands

of machines, moving data frequently toward distant tasks might become one of the

major bottlenecks. As such, an optimal task scheduler would strike a balance among

system utilization, load balancing, task parallelism, communication overhead, and

scalability so as performance is improved and costs are reduced. Unfortunately, in

practice, this is very hard to accomplish. In reality, most task schedulers attempt to

optimize one objective and overlook the others.

Another major challenge when scheduling jobs and tasks is to meet what is known as

service-level objectives (SLOs). SLOs reflect the performance expectations of end-users.

Amazon, Google, and Microsoft have identified SLO violations as a major cause of user

dissatisfaction [31,64,79]. For example, SLO can be expressed as a maximum latency for

allocating the desired set of resources to a job, a soft/hard deadline to finish a job, or GPU

preferences of some tasks, among others. In multi-tenant heterogeneous clusters, SLOs

are hard to achieve, especially upon the arrival of new jobs while others are executing.

This might require suspending currently running tasks to allow the newly arrived ones to

proceed and, subsequently, meet their specified SLOs. The capability of suspending and

resuming tasks is referred to as task elasticity . Unfortunately, most distributed analytics

engines at the moment; including Hadoop MapReduce, Pregel, and GraphLab do not

support task elasticity. Making tasks elastic is quite challenging. It demands identifying

safe points where a task can be suspended. A safe point in a task is a point at which the

correctness of the task is not affected, and its committed work is not all repeated when

it is suspended then resumed. In summary, meeting SLOs, enhancing system utilization,

balancing load, increasing parallelism, reducing communication traffic, and facilitating

scalability are among the objectives that make job and task scheduling one of the major

challenges in developing distributed programs for the cloud.

1.7 SUMMARY

To this end, we conclude our discussion on distributed programming for the cloud.

As a recap, we commenced our treatment for the topic with a brief background on

the theory of distributed programming. Specifically, we categorized programs into

sequential, parallel, concurrent, and distributed programs and recognized the dif-

ference among processes, threads, tasks, and jobs. Second, we motivated the case

for distributed programming and explained why cloud programs (a special type

of distributed programs) are important for solving complex computing problems.

Third, we defined distributed systems and indicated the relationship between distrib-

uted systems and clouds. Fourth, we delved into the details of the models that cloud

programs can adopt. In particular, we presented the distributed programming (i.e.,

shared memory or message passing), the computation (i.e., synchronous or asynchro-

nous), the parallelism (i.e., graph-parallel or data-parallel), and the architectural (i.e.,

master/slave or peer-to-peer) models in detail. Lastly, we discussed the challenges

with heterogeneity, scalability, communication, synchronization, fault tolerance, and

scheduling, which are encountered when constructing cloud programs.

Throughout our discussion on distributed programming for the cloud, we also

indicated that it is extremely advantageous to relieve programmers from worrying

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home