EIGHT YEARS USING GRIDS FOR LIFE SCIENCES - Collaborative Computational Technologies for Biomedical Research

Biomedical Engineering Reference

In-Depth Information

language [28]. In the same way, other hardware accelerators like fi eld program-

mable gate arrays (FPGA) have been considered to speed up various parallel

applications [29], including multiscale simulations [30].

Current GP-GPUs [31] now allow the execution of hundreds of threads

with a regular PC hosting a device card. This capability can be exploited in the

case of life science applications when we have to compute the same algorithm

many times. Since the introduction of Tesla boards by Nvidia, the single-

precision performances show very interesting improvement even when com-

pared to the latest CPU processors. The interconnection of GP-GPU boards

and servers is used to build clusters [32], and nowadays grids of hybrid machines

are even on track and used for computer-intensive bioinformatics application

[33]. Different brands of GP-GPU exist and ATI is also proposing very inter-

esting cards. In the case of the widely spread Nvidia Tesla 10, the board pro-

poses 240 vector cores split in 30 streaming multiprocessors (SM) with eight

thread processors (SP thread processors) each. Each streaming multiproces-

sors can run a set of 32 threads (warp) with the same control fl ow for different

data leading in one GPU cycle (each SP computes four identical operations

per GPU cycle); and since an SM can schedule up to 32 warps at a time, it

leads to potentially 1024 threads running concurrently on each of the 30 SMs.

The GP-GPU programming environment is proposed by the manufacturing

company: for instance, CUDA (Compute Unifi ed Device Architecture * ) in the

case of Nvidia or a portable programming standard usable for different manu-

facturing brands (OpenCL † ).

However, programming GP-GPUs can be tricky. The main diffi culty lies in

the memory manipulation since GP-GPU have various levels of memory with

different performances. The GP-GPU global memory which can be accessed

by any thread at any time has very important access latency. The shared

memory available for each streaming multiprocessor inside a GP-GPU does

not have such latency, but this memory is only shared by threads running on

the same multiprocessor. In addition, the number of concurrent threads in a

streaming multiprocessor is limited. Moreover, memory transfer between the

host computer and the GP-GPU device can severely damage the global

speedup if the computation time is not signifi cant enough in comparison with

the data transfer time. With CUDA, all the threads needed to execute a kernel

must be grouped in blocks, and all these blocks must have the same, limited

number of threads. All the threads of a block are executed on the same mul-

tiprocessor and therefore can make use of its shared memory. To get the best

results from GP-GPUs, we have to place data in shared memory. This is the

fastest memory managed by a streaming multiprocessor. The latency for

accessing global memory is very high, and we have to limit its access. Thus,

* What is CUDA? See the CUDA website: http://www.nvidia.co.uk/object/cuda_what_is_uk.html.

Accessed January 20, 2010.

† OpenCL Overview. See the OpenCL website from Khronos: http://www.khronos.org/opencl/.

Accessed January 20, 2010.

Collaborative Computational Technologies for Biomedical Research

Search WWH ::

Custom Search

Home