Waveform coders. ADPCM ITU-T G.726 (VoIP Protocols)

2.5
Waveforms coders are also called temporal speech coders; they rely on a time domain and sample-by-sample approach. Such coders use the correlation between continuous samples of speech and are based on adaptive quantizers and adaptive (generally backward) predictors. They are very efficient in the range 40-24 kbit/s, but quality degrades quickly (around 16 kbit/s).
Principle of Huffman encoding.
Figure 2.37 Principle of Huffman encoding.
The most widely used standardized waveform coder (excluding ITU-T G.711) is the ADPCM ITU-T G.726 [A13] speech coder which operates at 16, 24, 32,4 or 40 kbit/s. The 32-kbit version is used in DECT (digital enhanced cordless telecommunication) wireless phones in Europe, in PHS (personal handy-phone system) phones in Japan, or in DCME (digital circuit multiplication equipment) device on submarine cables.
ADPCM stands for adaptive differential pulse code modulation; the name itself explains the basic principle of the G.726 speech coder (see Figure 2.38).
The adaptive quantizer is a one-word memory type (or Jayant type) as described in Section 2.2. The adaptive predictor is a mixed structure with six zeros and two poles; it processes the reconstructed signal using a two-coefficient adaptive filter (the poles) and the decoded difference signal using six-coefficient adaptive filter (the zeros).
The basic scheme (Figure 2.38) does not include some useful features such as a dynamic switch for selecting alternative strategies when voice band modem signals are detected in order to allow the ADPCM coder to adapt to modem signals. One of the major drawbacks of coding schemes that reduce the bitrate and rely on the speech characteristics is that they fail for non-speech signals: voice band modem signals are completely synthetic and do not fit the prediction and adaptation procedures tailored for speech signals. The dynamic strategy switch allows transmission of a 9,600-bit/s modem signal for 32 kbit/s ADPCM and a 144,00-bit/s signal (V.33) for 40 kbit/s ADPCM.
The G.726 and its predecessor G.721, standardized in 1984, were the first bit reduction schemes used for civilian telecommunications. It is still one of the most widely used coders
 ITU-T G.726 ADPCM (16, 24, 32, or 40 kbit/s) basic scheme. The distant decoder is equivalent to the local decoder inside the dotted box.
Figure 2.38 ITU-T G.726 ADPCM (16, 24, 32, or 40 kbit/s) basic scheme. The distant decoder is equivalent to the local decoder inside the dotted box.
over terrestrial and submarine cables, in combination with speech interpolation. Speech interpolation relies in the statistical distribution of speech activity on a large number of affluent speech links. In a conventional conversation, each speaker is active less than 50% of the time on each side of the transmission link; the corresponding bandwidth can be used to transmit another voice channel. This becomes even easier with VoIP by using discontinuous speech transmission. Using speech interpolation and ADPCM (G.726 ADPCM at 32 kbit/s) a DCME can achieve a compression gain of 4 to 5.
Due to the symmetrical form of the encoder and decoder (they only differ by their quantizer procedures) of ADPCM, both use a similar processing power of approximately 5 MIPS (16-bit fixed point). Despite this low complexity, the speech quality of G.726 is very good (above 24 kbit/s), as indicated in Figure 2.39.
One interesting feature of the ADPCM coder is its relative immunity to bit errors compared with PCM. As shown in Figure 2.40, there is a significant difference for a BER (bit error rate) of 10—3 in favor of the ADPCM coder. There are two main reasons: PCM is very sensitive to an error on the sign bit, and ADPCM combines the state variables of the algorithm and consequently, it becomes more robust. This is a typical difference that disappears in VoIP, as errors do not occur as isolated bit errors, but result in complete frame loss (as a packet is rejected if the checksum is wrong).
Although ADPCM coders are not based on a frame-by-frame analysis and speech-coding procedure, in some circumstances (e.g., for voice over IP), ADPCM codes may be transmitted in a packet form. One packet assembles several codes (typically 10-30 ms), each corresponding to one unique sample. In the case of packet loss or ‘frame’ errors, the situation with PCM or ADPCM can be disastrous compared with hybrid or ABS (analysis
Typical MOS scores of common voice coders.
Figure 2.39 Typical MOS scores of common voice coders.
Comparison between the BER sensitivity of ADPCM and that of PCM.
Figure 2.40 Comparison between the BER sensitivity of ADPCM and that of PCM.
by Synthesis) speech coders which can rely on the last valid received parameters (such as LPC and LTP coefficients) to rebuild an approximation of the complete form of the signal for the lost frame. For ADPCM, the loss of many code words breaks the pursuit of the distant decoder against the local decoder, and a long time (250-500 ms) is needed to recover a stable state.
2.5.1


Coder specification … from digital test sequences to C code

The G.726 (or more exactly its predecessor G.721) was the first speech coder whose specification included an exhaustive set of digital test vectors. This is required to insure interoperability between equipment built by different manufacturers.
The set of test vectors was required because G.726 did not include a C reference, but an extensive documentation on a fixed point implementation. The fixed point implementation is a strong requirement for economical implementations in DSPs (digital signal processors) or for dedicated VLSI. The ITU-T recommendation includes the exact format (fixed
point) of the variables, constants, state variables, and tables used in the algorithm. It also describes most of the operations required by the algorithm, such as addition, subtraction, fixed point multiplication, and control of possible saturation (which may happen frequently in fixed point arithmetic).
The lack of a reference code was a problem, and later ITU-T introduced reference fixed point ANSI C code for new coders, where all mathematical operations (add, multiply, etc.) are fully specified (this reference implementation is called basic op for ‘basic operation’). Today, an ANSI C reference code is the main part of the recommendation of many speech coders, such as ITU-T G.723.1 or G.729. Test vectors are also provided to facilitate the verification of compliance to the standard. These test vectors are designed to provide an extensive coverage of the algorithms used in the implementation for both coding and decoding.
Floating point versions of some algorithms are also useful to improve the quality of implementations in PCs and workstations, and eliminate interoperability issues between fixed point and floating point implementations (e.g., a VoIP gateway using fixed point DSPs and a client PC software using native floating point arithmetic for efficiency). Specific test vectors also help verify the interoperability between different floating point implementations, due to the variety of floating point number representations.
2.5.2

Embedded version of the G.726 ADPCM coder G.727

One desirable feature of a coder is the ability to dynamically adjust coder properties to the instantaneous conditions of transmission channels. This requires some synchronization between the encoder and the decoder when the encoding properties change.
ADPCM can dynamically switch between one of the multiple encoding rates. In this case embedded means that a core quantizer is used for the fundamental operations of the coder, and additional quantification bits are allocated to an ‘enhancement’ quantizer. The scale used by the core quantizer is subdivided to form the scale of the enhancement quantizer. In order to ensure that synchronization is not lost even if some ‘enhancement’ bits are changed or even not transmitted, the decoder synchronization state is based only on the bits from the ‘core’ quantizer. This makes it possible to steal or remove some bits in the transmitted code words without desynchronizing the distant decoder, allowing a ‘graceful’ degradation in the decoded speech without requiring external signaling transmission means. This feature is very useful in applications, such as DCME or PCME (packet circuit multiplication equipment), in overload situations (too many active channels present at the same time) or for ‘in band’ signaling or ‘in band’ data transmission.
This concept is used in the embedded version of the G.726 (ITU-T, G.727 recommendation [A1]). In order to insure that the distant decoder tracks the local decoder correctly and due to the fact that this distant decoder may receive code words with robbed bits, the inner loop of prediction relies on the inverse core version of the quantizer:
• On the encoder side, the difference signal is encoded with the full number of steps of the enhanced quantizer, but bits in excess in the enhanced version are masked before feeding the inverse core quantizer.
• On the decoder side, the excess bits of the received code word are masked in order to feed the core inverse quantizer which is used in the prediction and reconstruction inner loop, but the entire received code word enters the enhanced adaptive quantizer, whose output is used to build the final output.
If there are no robbed bits, the output quality is enhanced, but is not as good as if the enhanced version of the quantizer had been used in the inner loop of the encoder and decoder, using all available quantization bits: that is the price to pay for the ‘embedded’ feature.
Figures 2.41 and 2.42 illustrate the G.727 concept.
2.5.3

Wide-band speech coding using a waveform-type coder

2.5.3.1

G.722

In the world of telephony, G.711 is frequently used as ‘the’ reference of voice quality, ignoring the fact that G.711 encodes only the 300-3,400-Hz band. The truth is that it is very difficult to go beyond G.711 quality in traditional telephone networks, because most of the components, from switches to transmission links, assume a G.711 signal (with the exception of transparent ISDN, which is available in some countries).
This is no longer true with voice over IP, where virtually any encoding scheme can be used end to end on the IP network. There are strong requirements to offer a better speech
G.727 encoder. ITU-T G.727 embedded ADPCM (16, 24, 32, or 40 kbit/s) basic scheme. G.727 is characterized by the enhance and core pairs (E, C) values for quantizers. C can have 2, 3, or 4 as values and E 2, 3, 4, or 5.
Figure 2.41 G.727 encoder. ITU-T G.727 embedded ADPCM (16, 24, 32, or 40 kbit/s) basic scheme. G.727 is characterized by the enhance and core pairs (E, C) values for quantizers. C can have 2, 3, or 4 as values and E 2, 3, 4, or 5.
G.727 decoder. ITU-T G.727 embedded ADPCM (16, 24, 32, or 40 kbit/s) decoder basic scheme.
Figure 2.42 G.727 decoder. ITU-T G.727 embedded ADPCM (16, 24, 32, or 40 kbit/s) decoder basic scheme.
and audio quality for videoconference and audioconference systems [A4, A14]. While most coders focus on providing an acceptable voice quality for the lowest possible bitrate, it is also possible to increase the audio quality as much as possible for a given bitrate.
Scientists and engineers were well aware of the possibilities of waveform ADPCM speech coders to reduce the bitrate by a factor of about 0.5 and naturally tried to use a similar technique to encode wide-band speech. Wide band refers to a transmitted frequency band of 50 Hz up to 7,000 Hz compared with the traditional telephony bandwidth (300 Hz
to 3,400 Hz).
G.722 was proposed by France Telecom and NTT, and adopted by ITU in 1988. The fundamental idea is to split the band to be transmitted in two subbands: a lower sub-band spanning from 0 Hz to 4,000 Hz and a higher subband spanning from 4,000 Hz to 8,000 Hz. Then, after a subsampling procedure reducing the sampling frequency from the original 16 kHz down to 8 kHz, two ‘classical’ ADPCM encoders can be applied to reduce the bitrate. Subsampling is possible because subband frequency filtering has eliminated the aliasing effect.
Subband separation uses a pair of quadratic mirror filters. QMF filters are the precursors of the filter bank theory used for psychoacoustic coders.5 In many ways the wide-band ITU-T G.722 speech and audio coder is a precursor of the more recent psychoacoustic audio coders: the splitting of the original band into two subbands and the allocation of more bits in the lower subband optimizes the efficiency of the prediction that the most sensitive frequency band performs noise quantization masking. The energy of speech
signals is more concentrated in the lower subband, and allocating more bits in this subband increases the quality of decoded speech.
G.722 encodes a wide-band signal into a bitstream of 64 kbit/s (the basic PCM bitrate). In the lower subband, 6 bits are used for the adaptive quantizer with an embedded characteristic: the core quantizer uses 4 bits and the enhanced version uses 6 bits. This scheme is very similar to the one found in the embedded version (G.727). This allows the system to steal some bits for signaling purposes (framing with H.221) and to transmit some ancillary data. The decoder should be signaled the mode of operation (64, 56, or 48 kbit/s), although some realizations do not signal the mode and permanently use the full 6 bits. In the higher subband, a 2-bit adaptive quantizer (nonembedded) is used producing a 16-kbit/s bitrate (much lower than the 48 kbit/s used for the lower subband which is perceptually more important).
The coding scheme of G.722 is illustrated in Figure 2.43, and the decoding principle of G.722 is shown on Figure 2.44.
The ITU-T G.722 wide-band speech coder is commonly used in teleconference systems adhering to the H.320 recommendation. The quality is quite good for speech and music at 64 kbit/s and 56 kbit/s (MOS of 4.3 and 4 compared with an original with the same bandwidth rated at 4.3). As there is no specific ‘production model’ (e.g., for speech) in that waveform coder, samples of music are correctly encoded.6 When used at 48 kbit/s, reproduced speech becomes more noisy (due to the 4-bit quantizer in the lower subband).
G.722 shares with other waveform ADPCM coder types a relative immunity to bit errors and is more robust than a direct PCM stream. The low-delay characteristic of the G722 is also a major advantage compared with more recent frame-based audio coding schemes. All
G.722 encoder. ITU-T G.722 wide-band encoder, subband ADPCM with QMF filter (48-kbit/s embedded ADPCM in lower subband and 16-kbit/s ADPCM in higher subband).
Figure 2.43 G.722 encoder. ITU-T G.722 wide-band encoder, subband ADPCM with QMF filter (48-kbit/s embedded ADPCM in lower subband and 16-kbit/s ADPCM in higher subband).
 G.722 decoder. ITU-T G.722 wide-band decoder, subband ADPCM with QMF filter (48-kbit/s embedded ADPCM in lower subband and 16-kbit/s ADPCM in higher subband).
Figure 2.44 G.722 decoder. ITU-T G.722 wide-band decoder, subband ADPCM with QMF filter (48-kbit/s embedded ADPCM in lower subband and 16-kbit/s ADPCM in higher subband).
the waveform coders, such as ADPCM and PCM, have very low algorithmic delay ranging from three to four samples (300-500 |is with an 8-kHz sampling frequency). In the case of G.722, QMF analysis and synthesis filters add a delay of about 3 ms. The resulting total delay remains excellent and ensures good interactivity for teleconference systems.
G.722 is one of the coders recommended for use in H.323 systems and is available in several commercial implementations.
2.5.3.2

G.722.1

One of the limitations of G.722 is that it cannot be used below 48 kbit/s. The more recent G.722.1 (September 1999) can encode a wide-band signal with a bitrate of 24 kbit/s or 32 kbit/s (a proprietary Picturetel version exists at 16 kbit/s, called Siren™).
G.722.1 works on frames of 40 ms (640 samples sampled at 16 kHz) with an overlap of 20 ms. On each frame of 40 ms, it multiplies the signal by a sinusoid (therefore the amplitude of the signal at both ends of the frame converges to 0), then performs a discrete cosine transform (DCT). The whole operation is called the modulated lapped transform (MLT); it is illustrated in Figure 2.45.
The result is the encoding of a 20-ms frame using 480 bits at 24 kbit/s and 640 bits at 32 kbit/s. Each frame is encoded independently; there is no state at the receiver. This interesting property prevents frame de-synchronization in the case of frame erasures, typically on VoIP systems. The resulting spectrum is analysed in 16 regions, in order to determine which region is more important (perception model) for the listener. Each frequency region is then quantized and vector-encoded using a Huffman encoding. The more important frequency regions (from a perception point of view) are allocated more bits than the less important frequency regions.
Modulated lapped transform used in G.722.1.
Figure 2.45 Modulated lapped transform used in G.722.1.
This coder uses about 14 MIPS (3% of a Pentium PIII-600) and is supported in the Windows XP® Messenger softphone under the proprietary 16-kbit/s version (Siren™).

Next post:

Previous post: