Information Technology Reference
In-Depth Information
THEOREM 7.7.6 A fully normal algorithm (ascending or descending) for an n -vertex hypercube
can be realized in O (log n ) steps on a canonical n -vertex cube-connected cycle network.
Thus, a fully normal algorithm on an n -vertex hypercube can be simulated on a CCC
network in time proportional to the time on the hypercube. However, the vertices of the CCC
have bounded degree, which makes them much easier to realize in hardware than high-degree
networks.
7.7.7 Fast Matrix Multiplication on the Hypercube
Matrix multiplication can be done more quickly on the hypercube than on a two-dimensional
array. Instead of O ( n ) steps, only O (log n ) steps are needed, as we show.
Consider the multiplication of n
n matrices A and B for n = 2 r to produce the product
×
matrix C = A
B . We describe a normal systolic algorithm to multiply these matrices on a
d -dimensional hypercube, d = 3 r .
Since d = 3 r ,theverticesofthe d -dimensional hypercube are addressed by a binary 3 r -
tuple, a = ( a 3 r− 1 , a 3 r− 2 , ... , a 0 ) .Letthe r least significant bits of a denote an integer i ,let
the next r lsb's denote an integer j ,andletthe r most significant bits denote an integer k .
Then, we have
×
= kn 2 + jn + i since n = 2 r . Because of this identity, we represent the
address a by the triple ( i , j , k ) . We speak of the processor P i , j , k located at the vertex ( i , j , k )
of the d -dimensional hypercube, d = 3 r .Wedenoteby HC i , j , the subhypercube in which i
and j are fixed and by HC i , , k and HC , j , k the subhypercubes in which the two other pairs
of indices are fixed. There are 2 2 r subhypercubes of each kind.
We assume that each processor P i , j , k contains three local variables, A i , j , k , B i , j , k ,and
C i , j , k . We also assume that initially A i , j ,0 = a i , j and B i , j ,0 = b i , j ,where0
|
a
|
i , j
n
1.
The multiplication algorithm has the following five phases:
a) For each subhypercube HC i , j , and for 1
k
n
1, broadcast A i , j ,0 (containing a i , j )
to A i , j , k and B i , j ,0 (containing b i , j )to B i , j , k .
b) For each subhypercube HC i , , k and for 0
j
n
1, j
= k ,broadcast A i , k , k
(containing a i , k )to A i , j , k .
c) For each subhypercube HC , j , k and for 0
i
n
1, i
= k ,broadcast B k , j , k (con-
taining b k , j )to B i , j , k .
d) At each processor P i , j , k compute C i , j , k = A i , j , k ·
B i , j , k = a i , k b k , j .
e) At processor P i , j ,0 compute the sum C i , j ,0 = k C i , j , k ( C i , j ,0 now contains c i , j =
k a i , k b k , j ).
From Theorem 7.7.2 it follows that each of these five steps can be done in O ( r ) steps,
where r =log 2 n . We summarize this result below.
THEOREM 7.7.7 Two n×n matrices, n = 2 r , can be multiplied by a normal systolic algorithm on
a d -dimensional hypercube, d = 3 r ,with n 3 processors in O (log n ) steps. All normal algorithms
for n
n matrix multiplication require Ω(log n ) steps.
Proof The upper bound follows from the construction. The lower bound follows from the
observation that each processor that is participating in the execution of a normal algorithm
×
Search WWH ::




Custom Search