Conclusion on speech-coding techniques and their near future (VoIP Protocols)

2.9
2.9.1

The race for low-bitrate coders

Many coding schemes have not been described in this topic:
• The MELP (mixed excitation LPC) coder, retained in the new 2,400-bit/s US Federal standard..
• The VSELP (vector sum excited LP) coder, used in the half-rate 5.6-kbit/s GSM system.
• The multi-rate Q-CELP (Qualcomm CELP) at 1, 2, 4 and 8 kbit/s, used in the cellular
US IS96 CDMA system.
• Multi-band excitation (MBE) coders.
• Sinusoidal transform coders (STCs).
The number of coding schemes reflects the constant progress of speech-coding technology. This progress has been driven by major telecommunication applications.
The first application of voice coding was the optimization of submarine cables and expensive long-distance links. The focus was on reducing bitrate while preserving good voice quality, and on providing reasonable support for modem and fax transmission. This led to relatively simple voice coders like the ITU G.726 at 32 kbit/s (1990).
Since 1990 the bitrate required to reach toll quality has decreased to about 8 kbit/s, or one bit per sample!
2.9.2

Optimization of source encoding and channel encoding

After 1999, the priority was no longer the absolute reduction of the bitrate, because the price of bandwidth continuously decreased on fixed lines. The driving application for voice-coding technology became wireless telephony. Wireless telephony offers a limited transmission bandwidth, which can be addressed by existing algorithms, but more importantly the transmission quality of the transport channel varies continuously. The best voice quality depends not only on how good the source encoding of the voice coder is, but also and, just as importantly, on how well channel encoding can correct transmission errors.
The priority of coder research became the optimal combination of source-encoding and channel-encoding methods in a given envelope. Both compete for the available bitrate on the channel:
• If the number or errors is low, the channel-encoding algorithm is not necessary and does not generate any redundancy information, and the full available bitrate can be used for the voice coder (source encoding).
• If the number of errors is high, the channel-encoding algorithm will generate a lot of redundancy information to protect voice coder information. As a consequence, the source-encoding algorithm needs to reduce its bitrate.
Dynamic optimization of the source-encoding and channel-encoding allocation within the available bitrate is a complex problem. The AMR and AMR-WB coders are the result of research carried out on this problem: both use multiple source-encoding algorithms, each combined with a channel-encoding algorithm, and the optimal mode is switched dynamically as transmission conditions change.
This new generation of voice coders provides a much more homogenous experience over a varying quality radio channel: voice quality does degrade as the radio conditions of the transmission channel get worse, but does so progressively, without the catastrophic
degradation experienced with single-mode codecs. Dynamic selection of the optimal source coder and channel coder makes the best possible use of the transport link under any conditions.
To a large extent, the enhancements of voice coders that were originally designed for radio channels are also valid for VoIP. The only significant difference, on is that radio channels create bit errors in the data stream (characterized by a bit error rate, or BER), while IP networks create frame-level (packet-level) errors. For a given bitrate, VoIP can also benefit from an optimal combination of source encoding and channel encoding, but the optimal channel-encoding method for VoIP differs from the optimal channel-encoding method for wireless applications, as it must protect against frame erasures.
2.9.3


The future

2.9.3.1

VoIP

What should we expect next? Perhaps the most important feedback from early VoIP trials was that there was no market for subtoll-quality voice. Users are not only not prepared to pay less for such voice quality, they are not prepared to pay at all. As a consequence there are no big incentives to continue to decrease the bitrate of a pure voice coder, and IP overheads would make such progress irrelevant anyway. Although there is still some progress for voice coders to make at 4 kbit/s and below, none of these coders achieves toll quality, and therefore they can only be used in degraded conditions, in combination with a high-redundancy channel-encoding method, or in military applications.
One of the issues about current coders is their poor performance for the transport of music, another is the degradation of voice encoding when there are multiple speakers or background noise. It seems that most of the efforts in the coming years will be to improve these weaknesses, while keeping a bitrate of 8 kbit/s or even above.
Unlike wireless networks, which will always have a tight bandwidth constraint (shared medium), VoIP applications benefit from the constant progress of wired transmission links. As the cost of bandwidth decreases, it becomes more interesting to provide users with a better telephony experience. Some VoIP systems already support wide-band voice coders, such as G.722, which make it easier to recognize the speaker and provide a more natural sound. Beyond wide band, multichannel coders (stereo, 5.1) can provide spatialized sound, which can be useful for audio- or videoconferences. Since 2002, it seems the focus of voice encoding for VoIP systems has shifted from low-bitrate encoders to these high-quality wide-band encoders.
We believe that in the coming years wideband encoders will become increasingly com-moninVoIPsystems.
2.9.3.2

Broadcast systems

Both wireless telephony and VoIP are interactive, one-to-one systems, where there is only one transmission channel. Audio broadcast systems on the Internet pose a different problem. For such systems there are many transmission channels, each with different degradation levels. If a separate information stream is sent on each channel (multi-unicast),
then dynamic mode selection works, but for multicast systems, where everyone receives the same information, it would not be optimal to use the bit allocation between source encoding and channel encoding that is optimal for the worst channel, as even listeners using the best transmission channels would experience poor audio quality.
For such links, the focus is on hierarchical coders, which produce several streams of information: a core stream, providing low quality, is transmitted with the highest possible redundancy (and an above-average QoS level is available), one or more enhancement streams that provide additional information on top of the core stream information, allowing receivers to improve playback quality. ISO-MPEG 4 is an example of a hierarchical encoder. These systems are mainly useful for broadcast of multicast systems, and their use for VoIP (e.g., with H.332) mainly depends on the deployment of multicast-capable IP networks.

Next post:

Previous post: