Database Reference
In-Depth Information
if done injudiciously, might increase communication overhead, impede scalability,
and potentially degrade performance. In fact, with datacenters hosting thousands
of machines, moving data frequently toward distant tasks might become one of the
major bottlenecks. As such, an optimal task scheduler would strike a balance among
system utilization, load balancing, task parallelism, communication overhead, and
scalability so as performance is improved and costs are reduced. Unfortunately, in
practice, this is very hard to accomplish. In reality, most task schedulers attempt to
optimize one objective and overlook the others.
Another major challenge when scheduling jobs and tasks is to meet what is known as
service-level objectives (SLOs). SLOs reflect the performance expectations of end-users.
Amazon, Google, and Microsoft have identified SLO violations as a major cause of user
dissatisfaction [31,64,79]. For example, SLO can be expressed as a maximum latency for
allocating the desired set of resources to a job, a soft/hard deadline to finish a job, or GPU
preferences of some tasks, among others. In multi-tenant heterogeneous clusters, SLOs
are hard to achieve, especially upon the arrival of new jobs while others are executing.
This might require suspending currently running tasks to allow the newly arrived ones to
proceed and, subsequently, meet their specified SLOs. The capability of suspending and
resuming tasks is referred to as task elasticity . Unfortunately, most distributed analytics
engines at the moment; including Hadoop MapReduce, Pregel, and GraphLab do not
support task elasticity. Making tasks elastic is quite challenging. It demands identifying
safe points where a task can be suspended. A safe point in a task is a point at which the
correctness of the task is not affected, and its committed work is not all repeated when
it is suspended then resumed. In summary, meeting SLOs, enhancing system utilization,
balancing load, increasing parallelism, reducing communication traffic, and facilitating
scalability are among the objectives that make job and task scheduling one of the major
challenges in developing distributed programs for the cloud.
1.7 SUMMARY
To this end, we conclude our discussion on distributed programming for the cloud.
As a recap, we commenced our treatment for the topic with a brief background on
the theory of distributed programming. Specifically, we categorized programs into
sequential, parallel, concurrent, and distributed programs and recognized the dif-
ference among processes, threads, tasks, and jobs. Second, we motivated the case
for distributed programming and explained why cloud programs (a special type
of distributed programs) are important for solving complex computing problems.
Third, we defined distributed systems and indicated the relationship between distrib-
uted systems and clouds. Fourth, we delved into the details of the models that cloud
programs can adopt. In particular, we presented the distributed programming (i.e.,
shared memory or message passing), the computation (i.e., synchronous or asynchro-
nous), the parallelism (i.e., graph-parallel or data-parallel), and the architectural (i.e.,
master/slave or peer-to-peer) models in detail. Lastly, we discussed the challenges
with heterogeneity, scalability, communication, synchronization, fault tolerance, and
scheduling, which are encountered when constructing cloud programs.
Throughout our discussion on distributed programming for the cloud, we also
indicated that it is extremely advantageous to relieve programmers from worrying
Search WWH ::




Custom Search