Databases Reference
In-Depth Information
17.3.3 Layer III Coding—mp3
Layer III coding, which has become widely popular under the name mp3, is considerably more
complex than the Layer I and Layer II coding schemes. One of the problems with the Layer I
and II coding schemes was that with 32-band decomposition, the bandwidth of the subbands at
lower frequencies is significantly larger than the critical bands. This makes it difficult to make
an accurate judgement of the mask-to-signal ratio. If we get a high amplitude tone within a
subband and if the subband is narrow enough, we can assume that it masked other tones in
the band. However, if the bandwidth of the subband is significantly higher than the critical
bandwidth at that frequency, it becomes more difficult to determine whether other tones in the
subband will be masked.
A simple way to increase the spectral resolution would be to decompose the signal directly
into a higher number of bands. However, one of the requirements on the Layer III algorithm
is that it be backward compatible with Layer I and Layer II coders. To satisfy this backward
compatibility requirement, the spectral decomposition in the Layer III algorithm is performed
in two stages. First the 32-band subband decomposition used in Layer I and Layer II is
employed. The output of each subband is then transformed using a modified discrete cosine
transform (MDCT) with a 50% overlap. The Layer III algorithm specifies two sizes for the
MDCT, 6 or 18. This means that the output of each subband can be decomposed into 18
frequency coefficients or 6 frequency coefficients.
The reason for having two sizes for the MDCT is that when we transform a sequence into
the frequency domain, we lose time resolution even as we gain frequency resolution. The
larger the block size the more we lose in terms of time resolution. The problem with this
is that any quantization noise introduced into the frequency coefficients will get spread over
the entire block size of the transform. Backward temporal masking occurs for only a short
duration prior to the masking sound (approximately 20 msec). Therefore, quantization noise
will appear as a pre-echo . Consider the signal shown in Figure 17.7 . The sequence consists of
128 samples, the first 118 of which are 0, followed by a sharp increase in value. The 128-point
DCT of this sequence is shown in Figure 17.8 . Notice that many of these coefficients are quite
large. If we were to send all these coefficients, we would have data expansion instead of data
compression. If we keep only the 10 largest coefficients, the reconstructed signal is shown
in Figure 17.9 . Notice that not only are the nonzero signal values not well represented, there
is also error in the samples prior to the change in value of the signal. If this were an audio
signal and the large values had occurred at the beginning of the sequence, the forward masking
effect would have reduced the perceptibility of the quantization error. In the situation shown in
Figure 17.9 , backward masking will mask some of the quantization error. However, backward
masking occurs for only a short duration prior to the masking sound. Therefore, if the length
of the block in question is longer than the masking interval, the distortion will be evident to
the listener.
If we get a sharp sound that is very limited in time (such as the sound of castanets) we
would like to keep the block size small enough that it can contain this sharp sound. Then,
when we incur quantization noise it will not get spread out of the interval in which the actual
sound occurred and will therefore get masked. The Layer III algorithm monitors the input
and where necessary substitutes three short transforms for one long transform. What actually
happens is that the subband output is multiplied by a window function of length 36 during
Search WWH ::




Custom Search