System Design Flow and Fixed-point Arithmetic - Digital Design of Signal Processing Systems: A Practical Approach

Digital Signal Processing Reference

In-Depth Information

3.4 Floating-point Format

Floating-point representation works well for numbers with large dynamic range. Based on the

number of bits, there are two representations in IEEE 754 standard [9]: 32-bit single-precision and

64-bit double-precision. This standard is almost exclusively used across computing platforms

and hardware designs that support floating-point arithmetic. In this standard a normalized floating-

point number x is stored in three parts: the sign s, the excess exponent e, and the significand or

mantissa m, and the value of the number in terms of these parts is:

Þ s

2 eb

x ¼ð

1

m

ð

3

:

6

Þ

This indeed is a signmagnitude representation, s represents the sign of the number and m gives the

normalizedmagnitudewith a 1 at theMSB position, and this implied 1 is not stored with the number.

For normalized values, m represents a fraction value greater than 1.0 and less than 2.0. This IEEE

format stores the exponent e as a biased number that is a positive number fromwhich a constant bias

b is subtracted to get the actual positive or negative exponent.

Figure 3.2 shows this representation for a single-precision floating point number. Such a number is

represented in 32 bits, where 1 bit is kept for the sign s, 8 bits for the exponent e and 23 bits for the

mantissa m. For a 64-bit double-precision floating-point number, 1 bit is kept for the sign,11 bits for

the exponent and 52 bits for the mantissa. The values of bias b are 127 and 1023, respectively, for

single-and double-precision floating-point formats.

Example: Find the value of the following 32-bit binary string representing a single-precision IEEE

floating-point format: 0_10000010_11010000_00000000_0000000. The value is calculated by

parsing the number into different fields, namely sign bit, exponent andmantissa, and then computing

the value of each field to get the final value, as shown in Table 3.4.

Example: This example represents

12.25 in single-precision IEEE floating-point format. The

number

12.25 in sign magnitude binary is

00001100.01. Now moving the decimal point to

bring it into the right format: 1100.01 2 0

¼ 1.10001 2 3 . Thus the normalized number is

2 3 .

1.10001

Sign bit

1

Mantissa field ð m Þ¼ 10001000 00000000 0000000

Exponent field ð e Þ¼ 3 þ 127 ¼ 130 ¼ 82 h ¼ 1000 0010

ð

s

Þ¼

So the complete 32-bit floating-point number in binary representation is:

1 10000010 10001000 00000000 0000000

s

e

m

sign

0 denotes +

1 denotes -

8 bit

true exponent = e -127

23 bit

mantissa

Figure 3.2

IEEE format for single-precision 32-bit floating point number

Digital Design of Signal Processing Systems: A Practical Approach

Search WWH ::

Custom Search

Home