Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

FIGURE 4.8 Summary of typical SIMD multimedia support for 256-bit-wide operations .

Note that the IEEE 754-2008 floating-point standard added half-precision (16-bit) and quad-

precision (128-bit) floating-point operations.

In contrast to vector architectures, which offer an elegant instruction set that is intended to

be the target of a vectorizing compiler, SIMD extensions have three major omissions:

■ Multimedia SIMD extensions fix the number of data operands in the opcode, which has led

to the addition of hundreds of instructions in the MMX, SSE, and AVX extensions of the

fix architecture. Vector architectures have a vector length register that specifies the num-

ber of operands for the current operation. These variable-length vector registers easily ac-

commodate programs that naturally have shorter vectors than the maximum size the archi-

tecture supports. Moreover, vector architectures have an implicit maximum vector length

in the architecture, which combined with the vector length register avoids the use of many

opcodes.

■ Multimedia SIMD does not offer the more sophisticated addressing modes of vector archi-

tectures, namely strided accesses and gather-scater accesses. These features increase the

number of programs that a vector compiler can successfully vectorize (see Section 4.7 ) .

■ Multimedia SIMD usually does not offer the mask registers to support conditional execu-

tion of elements as in vector architectures.

These omissions make it harder for the compiler to generate SIMD code and increase the dii-

culty of programming in SIMD assembly language.

For the x86 architecture, the MMX instructions added in 1996 repurposed the 64-bit loating-

point registers, so the basic instructions could perform eight 8-bit operations or four 16-bit

operations simultaneously. These were joined by parallel MAX and MIN operations, a wide

variety of masking and conditional instructions, operations typically found in digital signal

processors, and ad hoc instructions that were believed to be useful in important media librar-

ies. Note that MMX reused the floating-point data transfer instructions to access memory.

The Streaming SIMD Extensions (SSE) successor in 1999 added separate registers that were

128 bits wide, so now instructions could simultaneously perform sixteen 8-bit operations,

eight 16-bit operations, or four 32-bit operations. It also performed parallel single-precision

floating-point arithmetic. Since SSE had separate registers, it needed separate data transfer in-

structions. Intel soon added double-precision SIMD floating-point data types via SSE2 in 2001,

SSE3 in 2004, and SSE4 in 2007. Instructions with four single-precision floating-point opera-

tions or two parallel double-precision operations increased the peak floating-point perform-

ance of the x86 computers, as long as programmers place the operands side by side. With each

generation, they also added ad hoc instructions whose aim is to accelerate specific multimedia

functions perceived to be important.

Search WWH ::

Custom Search

Home