Biomedical Engineering Reference
In-Depth Information
TABLE 7.3 Scan Times (in seconds) of SwissProt (release
40, 113997 Protein Sequences) for Various HMM Lengths
with the Viterbi algorithm on Systola 1024 and a PC
Cluster with 16 Systolas
HMM length
112
222
490
Systola 1024
152
288
546
Speedup
3
3.5
4
PC cluster of 16 Systolas
12
22
40
Speedup
37
45
56
Note : The speedup compared to a Pentium III 1 GHz is also reported.
Systola 1024 is around two times faster than the much larger 1K-PE MasPar and the
cluster of 16 Systolas is around two times faster than a 16K-PE MasPar. The one-
board Kestrel is four to five times faster than a Systola board. Kestrel's design [8]
is also a programmable fine-grained parallel architecture implemented as a PC add-
on board. It reaches the higher performance, because it has been built with 0.5
m
µ
CMOS technology, in comparison with 1.0
m for Systola 1024. Extrapolating to
this technology both approaches should perform equally. However, the difference
between both architectures is that Kestrel is purely a linear array, whereas Systola is a
mesh. This makes the Systola 1024 a more flexible design, suitable for a wider range
of applications; see, for example, [27-32].
µ
7.4.1 Performance Evaluation of a Reconfigurable Platform
To evaluate our approach on a device based on state-of-the-art technology, we have
implemented the SW algorithm on a modern reconfigurable architecture. Figure 7.9
shows our PE design for computing one cell of the SW dynamic programming matrix.
A linear systolic array of this PE can be built to compute the SW algorithm as shown
in Figure 7.7.
As the query sequence is usually larger than the processor array, we have to do a
partitioning on a fixed-size linear processor array. For sake of clarity, we first assume
a query sequence of length M and a processor array of size N , where M is a multiple
of N , that is, M
1 is an integer. A possible solution is to split the
computation into k passes. Unfortunately, this solution requires a large amount of off-
chip memory. The memory requirement can be reduced by factor p by splitting the
database into p equal-sized pieces and computing the alignment scores of all subject
sequences within each piece.
However, this approach also increases the loading time of substitution table
columns by factor p . To eliminate this loading time, we have slightly extended our PE
design. Each PE now stores k columns of the substitution table instead of only one.
Although this increases the area of each PE, it allows for alignment of each database
sequence with the complete query sequence without additional delays. It also reduces
the required off-chip memory for storing intermediate results to four times the longest
=
kN where k
 
Search WWH ::




Custom Search