Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

Exercises

4.9 [10/20/20/15/15] <4.2> Consider the following code, which multiplies two vectors that con-

tain single-precision complex values:

for (i=0;i<300;i++) {

c_re[i] = a_re[i] * b_re[i] − a_im[i] * b_im[i];

c_im[i] = a_re[i] * b_im[i] + a_im[i] * b_re[i];

}

Assume that the processor runs at 700 MHz and has a maximum vector length of 64. The

load/store unit has a start-up overhead of 15 cycles; the multiply unit, 8 cycles; and the add/

subtract unit, 5 cycles.

a. [10] <4.2> What is the arithmetic intensity of this kernel? Justify your answer.

b. [20] <4.2> Convert this loop into VMIPS assembly code using strip mining.

c. [20] <4.2> Assuming chaining and a single memory pipeline, how many chimes are re-

quired? How many clock cycles are required per complex result value, including start-

up overhead?

d. [15] <4.2> If the vector sequence is chained, how many clock cycles are required per

complex result value, including overhead?

e. [15] <4.2> Now assume that the processor has three memory pipelines and chaining. If

there are no bank conflicts in the loop's accesses, how many clock cycles are required

per result?

4.10 [30] <4.4> In this problem, we will compare the performance of a vector processor with a

hybrid system that contains a scalar processor and a GPU-based coprocessor. In the hybrid

system, the host processor has superior scalar performance to the GPU, so in this case all

scalar code is executed on the host processor while all vector code is executed on the GPU.

We will refer to the first system as the vector computer and the second system as the hybrid

computer. Assume that your target application contains a vector kernel with an arithmetic

intensity of 0.5 FLOPs per DRAM byte accessed; however, the application also has a scalar

component which that must be performed before and after the kernel in order to prepare

the input vectors and output vectors, respectively. For a sample dataset, the scalar portion

of the code requires 400 ms of execution time on both the vector processor and the host

processor in the hybrid system. The kernel reads input vectors consisting of 200 MB of data

and has output data consisting of 100 MB of data. The vector processor has a peak memory

bandwidth of 30 GB/sec and the GPU has a peak memory bandwidth of 150 GB/sec. The

hybrid system has an additional overhead that requires all input vectors to be transferred

between the host memory and GPU local memory before and after the kernel is invoked.

The hybrid system has a direct memory access (DMA) bandwidth of 10 GB/sec and an av-

erage latency of 10 ms. Assume that both the vector processor and GPU are performance

bound by memory bandwidth. Compute the execution time required by both computers

for this application.

4.11 [15/25/25] <4.4, 4.5> Section 4.5 discussed the reduction operation that reduces a vector

down to a scalar by repeated application of an operation. A reduction is a special type of a

loop recurrence. An example is shown below:

dot=0.0;

for (i=0;i<64;i++) dot = dot + a[i] * b[i];

A vectorizing compiler might apply a transformation called scalar expansion , which ex-

pands dot into a vector and splits the loop such that the multiply can be performed with a

vector operation, leaving the reduction as a separate scalar operation:

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home