Digital Signal Processing Reference
In-Depth Information
truncation
0 1 1 1_0 1 1 1 in Q4.4 is 7.4375
rounding
1
0 1 1 1_1 0 0 1
0 1 1 1_1 0 0 = 7.5
Figure 3.11 Rounding followed by truncation
3.5.5.1 Simple Truncation
In multiplication of two Q-format numbers, the number of bits in the product increases. The
precision is sacrificed by dropping some low-precision bits of the product: Qn 1 .m 1 is truncated to
Qn 1 .m 2 , where m 2 < m 1 .
Example:
0111 0111 in Q4
4375
Truncated to Q4 : 2 gives 0111 01 ¼ 7 : 25
:
4is7
:
3.5.5.2 Rounding Followed by Truncation
Direct truncation of numbers biases the results, so inmany applications it is preferred to round before
trimming the number to the desired size. For this, 1 is added to the bit that is at the right of the position
of the point of truncation. The resultant number is then truncated to the desired number of bits. This is
shown in the example in Figure 3.11. First rounding and then truncation gives a better approxima-
tion; in the example, simple truncation toQ4.2 results in a number with value 7.25, whereas rounding
before truncation gives 7.5 - which is closer to the actual value 7.4375.
3.5.6 Overflow and Saturation
Overflow is a serious consequence of fixed-point arithmetic. Overflow occurs if two positive or
negative numbers are added and the sum requires more than the available number of bits. For
example, in a 3-bit two's complement representation, if 1 is added to 3 (
3 0 b011), the sum is 4 (
¼
¼
4 0 b0100). The number 4 thus
requires
four bits
and cannot be
represented as
a
3-bit two's complement signed number as 3 0 b100 (
4). This causes an error equal to the full
dynamic range of the number and so adversely affects subsequent computation that uses this number.
Figure 3.12 shows the case of an overflow for a 3-bit number, adding an error equal to the dynamic
range of the number. It is therefore imperative to check the overflow condition after performing
arithmetic operations that can cause a possible overflow. If an overflow results, the designer should
set an overflow flag. In many circumstances, it is better to curtail the result to the maximum positive
or minimum negative value that the defined word length can represent. In the above example the
value should be limited to 3 0 b011.
Thus, the following computation is in 3-bit precision with an overflow flag set to indicate this
abnormal result:
¼
1
Similarly, performing subtraction with an overflow flag set to ! is:
3
þ
1
¼
3 and overflow flag
¼
4 1 ¼ 4 overflow flag ¼ 1
Search WWH ::




Custom Search