Information Technology Reference
In-Depth Information
narrower than the critical bands of the ear. Figure 4.27 showed the critical condition where the masking tone is at
the top edge of the sub-band. The use of an excessive number of sub-bands will, however raise complexity and the
coding delay. The use of 32 equal sub-bands in MPEG Layers I and II is a compromise.
Efficient polyphase bandsplitting filters can only operate with equal- width sub-bands and the result, in an octave-
based hearing model, is that sub-bands are too wide at low frequencies and too narrow at high frequencies.
To offset the lack of accuracy in the sub-band filter a parallel fast Fourier transform is used to drive the masking
model. The standard suggests masking models, but compliant bitstreams can result from other models. In Layer I a
512-point FFT is used. The output of the FFT is employed to determine the masking threshold which is the sum of
all masking sources. Masking sources include at least the threshold of hearing which may locally be raised by the
frequency content of the input audio. The degree to which the threshold is raised depends on whether the input
audio is sinusoidal or atonal (broadband, or noise-like). In the case of a sine wave, the magnitude and phase of the
FFT at each frequency will be similar from one window to the next, whereas if the sound is atonal the magnitude
and phase information will be chaotic.
The masking threshold is effectively a graph of just noticeable noise as a function of frequency. Figure 4.31 (a)
shows an example. The masking threshold is calculated by convolving the FFT spectrum with the cochlea
spreading function with corrections for tonality. The level of the masking threshold cannot fall below the absolute
masking threshold which is the threshold of hearing. The masking threshold is then superimposed on the actual
frequencies of each sub-band so that the allowable level of distortion in each can be established. This is shown in
Figure 4.31 (b).
Figure 4.31: A continuous curve (a) of the just-noticeable noise level is calculated by the masking model. The
levels of noise in each sub-band (b) must be set so as not to exceed the level of the curve.
Constant size input blocks are used, containing 384 samples. At 48 kHz, 384 samples corresponds to a period of 8
ms. After the sub-band filter each band contains 12 samples per block. The block size is too long to avoid the pre-
masking phenomenon of Figure 4.27 . Consequently the masking model must ensure that heavy requantizing is not
used in a block which contains a large transient following a period of quiet. This can be done by comparing
parameters of the current block with those of the previous block as a significant difference will indicate transient
activity.
The samples in each sub-band block or bin are companded according to the peak value in the bin. A six-bit scale
factor is used for each sub- band which applies to all 12 samples. The gain step is 2 dB and so with a six-bit code
over 120 dB of dynamic range is available.
A fixed output bit rate is employed, and as there is no buffering the size of the coded output block will be fixed. The
wordlengths in each bin will have to be such that the sum of the bits from all the sub-bands equals the size of the
coded block. Thus some sub-bands can have long wordlength coding if others have short wordlength coding. The
process of determining the requantization step size, and hence the wordlength in each sub-band is known as bit
allocation. In Layer I all sub-bands are treated in the same way and fourteen different requantization classes are
used. Each one has an odd number of quantizing intervals so that all codes are referenced to a precise zero level.
Where masking takes place, the signal is quantized more coarsely until the distortion level is raised to just below
the masking level. The coarse quantization requires shorter wordlengths and allows a coding gain. The bit
allocation may be iterative as adjustments are made to obtain an equal NMR across all sub-bands. If the allowable
data rate is adequate, a positive NMR will result and the decoded quality will be optimal. However, at lower bit rates
and in the absence of buffering a temporary increase in bit rate is not possible. The coding distortion cannot be
masked and the best the encoder can do is to make the (negative) NMR equal across the spectrum so that artifacts
are not emphasized unduly in any one sub- band. It is possible that in some sub-bands there will be no data at all,
either because such frequencies were absent in the program material or because the encoder has discarded them
to meet a low bit rate.
The samples of differing wordlengths in each bin are then assembled into the output coded block. Unlike a PCM
block, which contains samples of fixed wordlength, a coded block contains many different wordlengths and which
may vary from one sub-band to the next. In order to deserialize the block into samples of various wordlength and
demultiplex the samples into the appropriate frequency bins, the decoder has to be told what bit allocations were
Search WWH ::




Custom Search