Hardware Reference
In-Depth Information
SIMD Threads in the same Thread Block can communicate via Local Memory. (The maximum
number of SIMD Threads that can execute simultaneously per Thread Block is 16 for Tesla-
generation GPUs and 32 for the later Fermi-generation GPUs.)
A Thread Block is assigned to a processor that executes that code, which we call a multi-
threaded SIMD Processor , by the Thread Block Scheduler . The Thread Block Scheduler has some
similarities to a control processor in a vector architecture. It determines the number of thread
blocks needed for the loop and keeps allocating them to different multithreaded SIMD Pro-
cessors until the loop is completed. In this example, it would send 16 Thread Blocks to multi-
threaded SIMD Processors to compute all 8192 elements of this loop.
Figure 4.14 shows a simplified block diagram of a multithreaded SIMD Processor. It is sim-
ilar to a Vector Processor, but it has many parallel functional units instead of a few that are
deeply pipelined, as does a Vector Processor. In the programming example in Figure 4.13 , each
multithreaded SIMD Processor is assigned 512 elements of the vectors to work on. SIMD Pro-
cessors are full processors with separate PCs and are programmed using threads (see Chapter
3 ) .
Search WWH ::




Custom Search