Digital Signal Processing Reference
In-Depth Information
3.4 Floating-point Format
Floating-point representation works well for numbers with large dynamic range. Based on the
number of bits, there are two representations in IEEE 754 standard [9]: 32-bit single-precision and
64-bit double-precision. This standard is almost exclusively used across computing platforms
and hardware designs that support floating-point arithmetic. In this standard a normalized floating-
point number x is stored in three parts: the sign s, the excess exponent e, and the significand or
mantissa m, and the value of the number in terms of these parts is:
Þ s
2 eb
x ¼ð
1
1
m
ð
3
:
6
Þ
This indeed is a signmagnitude representation, s represents the sign of the number and m gives the
normalizedmagnitudewith a 1 at theMSB position, and this implied 1 is not stored with the number.
For normalized values, m represents a fraction value greater than 1.0 and less than 2.0. This IEEE
format stores the exponent e as a biased number that is a positive number fromwhich a constant bias
b is subtracted to get the actual positive or negative exponent.
Figure 3.2 shows this representation for a single-precision floating point number. Such a number is
represented in 32 bits, where 1 bit is kept for the sign s, 8 bits for the exponent e and 23 bits for the
mantissa m. For a 64-bit double-precision floating-point number, 1 bit is kept for the sign,11 bits for
the exponent and 52 bits for the mantissa. The values of bias b are 127 and 1023, respectively, for
single-and double-precision floating-point formats.
Example: Find the value of the following 32-bit binary string representing a single-precision IEEE
floating-point format: 0_10000010_11010000_00000000_0000000. The value is calculated by
parsing the number into different fields, namely sign bit, exponent andmantissa, and then computing
the value of each field to get the final value, as shown in Table 3.4.
Example: This example represents
12.25 in single-precision IEEE floating-point format. The
number
12.25 in sign magnitude binary is
00001100.01. Now moving the decimal point to
bring it into the right format: 1100.01 2 0
¼ 1.10001 2 3 . Thus the normalized number is
2 3 .
1.10001
Sign bit
1
Mantissa field ð m Þ¼ 10001000 00000000 0000000
Exponent field ð e Þ¼ 3 þ 127 ¼ 130 ¼ 82 h ¼ 1000 0010
ð
s
Þ¼
So the complete 32-bit floating-point number in binary representation is:
1 10000010 10001000 00000000 0000000
s
e
m
sign
0 denotes +
1 denotes -
8 bit
true exponent = e -127
23 bit
mantissa
Figure 3.2
IEEE format for single-precision 32-bit floating point number
Search WWH ::




Custom Search