Biomedical Engineering Reference
In-Depth Information
language [28]. In the same way, other hardware accelerators like fi eld program-
mable gate arrays (FPGA) have been considered to speed up various parallel
applications [29], including multiscale simulations [30].
Current GP-GPUs [31] now allow the execution of hundreds of threads
with a regular PC hosting a device card. This capability can be exploited in the
case of life science applications when we have to compute the same algorithm
many times. Since the introduction of Tesla boards by Nvidia, the single-
precision performances show very interesting improvement even when com-
pared to the latest CPU processors. The interconnection of GP-GPU boards
and servers is used to build clusters [32], and nowadays grids of hybrid machines
are even on track and used for computer-intensive bioinformatics application
[33]. Different brands of GP-GPU exist and ATI is also proposing very inter-
esting cards. In the case of the widely spread Nvidia Tesla 10, the board pro-
poses 240 vector cores split in 30 streaming multiprocessors (SM) with eight
thread processors (SP thread processors) each. Each streaming multiproces-
sors can run a set of 32 threads (warp) with the same control fl ow for different
data leading in one GPU cycle (each SP computes four identical operations
per GPU cycle); and since an SM can schedule up to 32 warps at a time, it
leads to potentially 1024 threads running concurrently on each of the 30 SMs.
The GP-GPU programming environment is proposed by the manufacturing
company: for instance, CUDA (Compute Unifi ed Device Architecture * ) in the
case of Nvidia or a portable programming standard usable for different manu-
facturing brands (OpenCL ).
However, programming GP-GPUs can be tricky. The main diffi culty lies in
the memory manipulation since GP-GPU have various levels of memory with
different performances. The GP-GPU global memory which can be accessed
by any thread at any time has very important access latency. The shared
memory available for each streaming multiprocessor inside a GP-GPU does
not have such latency, but this memory is only shared by threads running on
the same multiprocessor. In addition, the number of concurrent threads in a
streaming multiprocessor is limited. Moreover, memory transfer between the
host computer and the GP-GPU device can severely damage the global
speedup if the computation time is not signifi cant enough in comparison with
the data transfer time. With CUDA, all the threads needed to execute a kernel
must be grouped in blocks, and all these blocks must have the same, limited
number of threads. All the threads of a block are executed on the same mul-
tiprocessor and therefore can make use of its shared memory. To get the best
results from GP-GPUs, we have to place data in shared memory. This is the
fastest memory managed by a streaming multiprocessor. The latency for
accessing global memory is very high, and we have to limit its access. Thus,
* What is CUDA? See the CUDA website: http://www.nvidia.co.uk/object/cuda_what_is_uk.html.
Accessed January 20, 2010.
OpenCL Overview. See the OpenCL website from Khronos: http://www.khronos.org/opencl/.
Accessed January 20, 2010.
Search WWH ::




Custom Search