Audio compression - The MPEG

Information Technology Reference

In-Depth Information

4.20 MPEG Layer III audio coding

Layer III is the most complex layer, and is only really necessary when the most severe data rate constraints must

be met. It is also known as MP3 in its application of music delivery over the Internet. It is a transform code based

on the ASPEC system with certain modifications to give a degree of commonality with Layer II. The original ASPEC

coder used a direct MDCT on the input samples. In Layer III this was modified to use a hybrid transform

incorporating the existing polyphase 32 band QMF of Layers I and II and retaining the block size of 1152 samples.

In Layer III, the 32 sub-bands from the QMF are further processed by a critically sampled MDCT.

The windows overlap by two to one. Two window sizes are used to reduce pre-echo on transients. The long

window works with 36 sub-band samples corresponding to 24 milliseconds only at 48 kHz and resolves 18 different

frequencies, making 576 frequencies altogether. Coding products are spread over this period which is acceptable

in stationary material but not in the vicinity of transients. In this case the window length is reduced to 8 ms. Twelve

sub-band samples are resolved into six different frequencies making a total of 192 frequencies. This is the

Heisenberg inequality: by increasing the time resolution by a factor of three, the frequency resolution has fallen by

the same factor.

Figure 4.36 shows the available window types. In addition to the long and short symmetrical windows there is a pair

of transition windows, known as start and stop windows which allow a smooth transition between the two window

sizes. In order to use critical sampling, MDCTs must resolve into a set of frequencies which is a multiple of four.

Switching between 576 and 192 frequencies allows this criterion to be met. Note that an 8 ms window is still too

long to eliminate pre-echo. Pre- echo is eliminated using buffering. The use of a short window minimizes the size of

the buffer needed.

Figure 4.36: The window functions of Layer III coding. At (a) is the normal long window, whereas (b) shows the

short window used to handle transients. Switching between window sizes requires transition windows (c) and (d).

An example of switching using transition windows is shown in (e).

Layer III provides a suggested (but not compulsory) pychoacoustic model which is more complex than that

suggested for Layers I and II, primarily because of the need for window switching. Pre-echo is associated with the

entropy in the audio rising above the average value and this can be used to switch the window size. The perceptive

model is used to take advantage of the high-frequency resolution available from the DCT which allows the noise

floor to be shaped much more accurately than with the 32 sub-bands of Layers I and II. Although the MDCT has

high-frequency resolution, it does not carry the phase of the waveform in an identifiable form and so is not useful

for discriminating between tonal and atonal inputs. As a result a side FFT which gives conventional amplitude and

phase data is still required to drive the masking model.

Non-uniform quantizing is used, in which the quantizing step size becomes larger as the magnitude of the

coefficient increases. The quantized coefficients are then subject to Huffman coding. This is a technique where the

most common code values are allocated the shortest wordlength. Layer III also has a certain amount of buffer

memory so that pre-echo can be avoided during entropy peaks despite a constant output bit rate.

Figure 4.37 shows a Layer III encoder. The output from the sub-band filter is 32 continuous band-limited sample

streams. These are subject to 32 parallel MDCTs. The window size can be switched individually in each sub-band

as required by the characteristics of the input audio. The parallel FFT drives the masking model which decides on

window sizes as well as producing the masking threshold for the coefficient quantizer. The distortion control loop

iterates until the available output data capacity is reached with the most uniform NMR. The available output

capacity can vary owing to the presence of the buffer.

The MPEG

Search WWH ::

Custom Search

Home