Database Reference
In-Depth Information
of gigabytes, terabytes, or petabytes. Also, Internet services such as e-commerce
and social networks deal with sheer volumes of data generated by millions of users
every day [83]. As per resources, cloud datacenters already host tens and hundreds of
thousands of machines (e.g., Amazon EC2 is estimated to host almost half a million
machines [46]), and projections for scaling up machine counts to extra folds have
already been set forth.
As pointed out in Section 1.3, upon scaling up the number of machines, what pro-
grammers/users expect is escalated performance. Specifically, programmers expect
from distributed execution of their programs on n nodes, vs. on a single node, an
n -fold improvement in performance. Unfortunately, this never happens in reality
due to several reasons. First, as shown in Figure 1.13, parts of programs can never
be parallelized (e.g., initialization parts). Second, load imbalance among tasks is
highly likely, especially in distributed systems like clouds. One of the reasons for
load imbalance is the heterogeneity of the cloud as discussed in the previous section.
As depicted in Figure 1.13b, load imbalance usually delays programs, wherein a pro-
gram becomes bound to the slowest task. Particularly, even if all tasks in a program
finish, the program cannot commit before the last task finishes (which might greatly
linger!). Lastly, other serious overheads such as communication and synchronization
can highly impede scalability. Such overheads are significantly important when mea-
suring speedups obtained by distributed programs compared with sequential ones. A
standard law that allows measuring speedups attained by distributed programs and,
additionally, accounting for various overheads is known as Amdahl's law .
For the purpose of describing Amdahl's law we assume that a sequential version
of a program P takes T s time units, while a parallel/distributed version takes T p time
units using a cluster of n nodes. In addition, we suppose that s fraction of the pro-
gram is not parallelizable. Clearly, this makes 1 s fraction of the program parallel-
izable. According to Amdahl's law, the speedup of the parallel/distributed execution
of P vs. the sequential one can be defined as follows:
Speedup p = T s /T p = T s /( T s × s + T s × (1 − s )/ n ) = 1/( s + (1 − s )/ n ).
2
8
2
8
Ta sk
Ta sk
2
2
2
Ta sk 1
Ta sk 1
Ta sk 2
Ta sk 2
Cannot be parallelized
Can be parallelized
Communication and
synchronization overhead
Ta sk 3
Ta sk 3
Ta sk 4
Ta sk 4
Load-imbalance
(a)
(b)
FIGURE 1.13
Parallel speedup. (a) Ideal case. (b) Real case.
 
Search WWH ::




Custom Search