Quality of speech coders (VoIP Protocols)

2.8
Most speech coders have been designed to achieve the best possible level of speech reproduction quality, within the constraints of a given source-encoding bitrate. For narrowband coders, the reference is ‘toll quality’, or the quality of speech encoded by the G.711 coder. For wide-band coders (transmitting the full 50-7,000-Hz band), the reference is
the G.722 coder.
In fact, assessing the quality of a speech coder is a complex task which depends on multiple parameters:
• The absolute quality of the reproduced speech signal. This is the most used figure, but does not take into account interactivity (i.e., the delay introduced by the speech coder in a conversation). Several methods exist to assess the absolute, noninteractive speech quality of a coder. We will describe the MOSs which are the result of the ACR (absolute category rating) method and the DMOSs obtained with the CCR (comparative category rating) method. Several environmental conditions may influence speech degradation and need to be taken into account, such as speech input level, the type and level of background noise (bubble noise, hall noise, etc.).
• The delay introduced by the coder algorithm (algorithmic delay). This delay is due to the size of the speech signal frames that are encoded and to the additional signal samples that the coder needs to accumulate before encoding the current frame (look-ahead). Obviously, delay is only relevant for interactive communications, not for voice storage applications or noninteractive streaming applications.
• The complexity of the coder, which will result in additional processing delay on a given processor.
• The behavior of the coder for music signals, modem signals (maximum transmission speed that can be obtained), and DTMF transmission.
• Tandeming properties (i.e., the number of times voice can be encoded and decoded before voice quality becomes unacceptable). This can be assessed with the same coder used repeatedly or with other well known coders (e.g., the GSM coders used in cellular phones).
• Sensibility to errors (bit errors for cellular or DCME applications, or for VoIP frame erasures).
• The flexibility of the coder to dynamically adapt bit allocation to congestion and degradation of the transmission channel. Some coders provide only a fixed bitrate, while others can switch between bitrates dynamically (embedded coders). Hierarchical coders like G.722 can generate several simultaneous streams of encoded speech data: a core stream that needs to be transmitted as reliably as possible through the transmission channel (either on a high QoS level or using an efficient redundancy mechanism), and one or more ‘enhancement’ streams that can be transmitted on lower quality channels.
The importance of these parameters depends on the final application and the target transmission network (fixed network, wireless network, serial transmission links that generate bit errors, packet transmission links that generate frame erasure errors, etc.). For most applications, the shortlist of key parameters includes the bitrate, complexity of the coder, delay, and quality.
When a standard body decides to standardize a new voice coder, the first step is to specify the quality acceptance criteria for the future coder. As an example, Table 2.14 [A19] is a summary of the terms of reference set to specify the ITU 8-kbit/s coder (the future G.729). This new coder was intended to ‘replace’ the G.726 at 32 kbit/s or the G.728 at 16 kbit/s.
2.8.1

Speech quality assessment

In order to assess the level of quality of a speech coder, objective measurements (computed from a set of measurements on the original signal and the reproduced signal) are not reliable for new coders. In fact, most objective, automated measurement tools can only be used on well-known coders and well-known networks, and simply perform some form of interpolation using quality scores in known degradation conditions obtained using a subjective method. In the early days of VoIP, people tended to apply the known, objective measurement tools, calibrated on fixed TDM networks, without realizing that transmission link properties were completely different: frame erasure as opposed to random bit errors, correlated packet loss, degradations due to the dynamic adaptation of jitter buffers, etc. Needless to say, many of these ‘objective’ tests were in fact designed as a marketing tool for this or that voice coder.
Subjective measurements are therefore indispensable. What can assess the quality of a voice coder better than a human being? Unfortunately, subjective measurements of speech quality require a substantial effort and are time-consuming. In order to obtain reliable and reproducible results, a number of precise guidelines must be followed:
• Ensure that the total number of listeners is sufficient for statistically reliable results.
• Ensure that the auditory perception of listeners is normal (medical tests may be necessary).
• Instruct the listeners of the methodology of the tests.
• Ensure that the speech material is diversified: gender of talkers, pronunciation, age of the talkers (child).

Table 2.14 Terms of reference for the 8 kbit/s ITU-T speech coder

Items	Parameters	Requirements	Objectives
Quality for speech		No worse than that of ITU-T
		G.726 at 32 kbit/s
Performance in the presence of	Random bit error: BER <0.1	No worse than that of ITU-T	Equivalent to ITU-T G.728
transmission errors (bit error)		G.726 at 32 kbit/s under
		similar conditions
Performance in the presence of	Indication of frame erasure	Less than 0.5 MOS when there	As small as possible
frame erasure	(random and burst)	are 3% missing frames
Input-level dependence	- 36 dB, -16 dB	No worse than that of ITU-T	As small as possible
		G.726 at 32 kbit/s
Algorithmic delay		<16 ms	<5 ms
Total codec delay		<32 ms	<10 ms
Cascading		2 asynchronous coding <4	3 asynchronous coding <4
		asynchronous ITU-T G.726 at	asynchronous ITU-T G.726 at
		32 kbit/s	32 kbit/s
Tandeming with other ITU-T		<4 asynchronous ITU-T G.726	3 asynchronous coding <4
standards		at 32 kbit/s	asynchronous ITU-T G.726 at
			32 kbit/s
Sensibility to background noise	Car noise	No worse than that of ITU-T
		G.726 at 32 kbit/s
	Bubble noise
	Multiple speakers

• Ensure that the test is performed in several languages by a number of experienced organizations (problems may occur with languages other than English, Japanese, German, Italian, Spanish, or French even on a well-standardized speech coder),
• Ensure that all the environmental conditions of use of the candidate coder are tested (such as level dependencies, sensibility to ambient noise and type of noise, error conditions, etc.).
• Appropriate choice of pertinent listening conditions: choice of equipment (headphones, telephone handsets, loudspeakers) and loudness of the samples.
These tests are fully specified in ITU-T recommendations ITU-T P.800 and P.830 [A1], [A4]. Obviously, these tests are very time-consuming, expensive (dedicated rooms and studios, high-quality audio equipment), and require well-trained and experienced organizations.
The following subsections provide an overview of these methods. We will focus on listening opinion tests, although other tests, such are conversation opinion tests, also exist.
2.8.2

ACR subjective test, mean opinion score (MOS)

For low-bitrate telephone speech coders (between 4 kbit/s and 32 kbit/s), the absolute category rating (ACR) is the most commonly used subjective measurement method. It is the method that produces the well-known MOS figure.
In ACR subjective tests, listeners are asked to rate the ‘absolute’ quality of speech samples, without knowing what the reference audio sample is. Listening quality is generally assessed using the scale in Table 2.15.
An MOS is an absolute judgment without references, but in order to insure coherence and calibration between successive tests, some reference is needed. For this purpose, a reference audio sample is inserted among the samples given to listeners (without any notice). Very often, the modulated noise reference unit (MNRU) is used: this device simulates voice degradation and noise level equivalent to that produced by the A- or |i-law PCM coding scheme. It is still common to read press articles or conference presentations that present ‘the’ MOS of a new coder without also presenting the MOS obtained in the test by a reference coder. Such values should be considered with skepticism: some vendors choose to give an MOS of ’5′ to G.711, shifting all MOSs up by almost one full MOS point, while others do not even have such a reference coder as part of their test.

Table 2.15 Listening quality scale for absolute category rating

Excellent	5
Good	4
Fair	3
Poor	2
Bad	1

The MOS figure is calculated statistically from the marks given to each audio sample by listeners. The relevance of MOS and the confidence interval of the results must be determined by statistical analysis, requiring a lot of experiments. Generally, an ACR subjective test requires an average of 24 listeners (3 groups of 8). The typical test sample consists in a double sentence: 0.5 s of silence, 2 s for sentence #1, 0.5 s of silence, 2 s for sentence #2.
Figure 2.57 provides an overview of typical MOS values for various categories of speech coders as a function of bitrate [A6]. More precisely, Table 2.16 gives the MOS figure and type of well-known, ITU-T standardized speech coders. For mobile standards see Table 2.17 and for DOD standards see Table 2.18.
Table 2.18 clearly shows the magnitude of the improvements in speech coders in ten years: the speech quality that can be obtained at 2.4 kbit/s goes from synthetic to 3.2 (fair)!

Figure 2.57 MOSs as a function of the bitrate and coder technology.

Table 2.16 MOSs of some ITU coders

Standard	G.711	G.726 or G.721	G.728	G.729	G.723.1
Date of	1972	1990 (1984)	1992	1995	1995
approbation
Bitrate (kbit/s)	64	16/24/32/40	16	8	6.3-5.3
Type of coder	Waveform:	Waveform:	ABS:	ABS:	ABS: MP-MLQ,
	PCM	ADPCM	LD-CELP	CS-ACELP	CS-ACELP
Speech quality	4.2	2/3.2/4/4.2	4.0	4.0	3.9/3.7
(MOS)

Table 2.17 MOSs of coders used in mobile telephony

Table 2.18 MOS scores of military coders

Standard	American	American	American DOD
	DOD FS1015	DOD FS1016
Date of approbation	1984	1990	1995
Bitrate (kbit/s)	2.4	4.8	2.4
Type of coder	Vocoder: LPC 10	ABS: CELP	ABS: MELP
Speech quality (MOS)	Synthetic quality	3	3.2

2.8.3

Other methods of assessing speech quality

ACR is not the only method available for speech quality assessments. The degradation category rating (DCR) and the comparison category rating (CCR) are also used, mostly for high-quality coders.
The DCR method is preferred when good-quality speech samples are to be compared. The DCR method produces a degradation mean opinion score (DMOS). The range of degradation is presented Table 2.19.
DCR methodology is similar to ACR, except that the reference sample is known to the listener and presented first: pairs of samples (A-B) or repeated pairs (A-B, A-B) are presented with A being the quality reference.
CCR is similar to DCR, but the order of the reference sample and the evaluated coder sample is chosen at random: this method is interesting mostly for speech enhancement systems. The result is a comparison mean opinion score (CMOS).
For all interactive communication systems, especially VoIP, conversational tests are also very instructive because they try to reproduce the real service conditions experienced by final users. The degradations introduced by echo and delays, not present in MOS tests, can also be taken into account. The test panel is asked to communicate using the system under test (e.g., DCME or VoIP) and is instructed to follow some scenario or to play some game and finally give their opinion on the communication quality and on other parameters, such as clarity, level of noise, perception of echoes, delays, interactivity, etc. Once again, each participant gives a score from 1 to 5 (as described in the ITU-T P.800 recommendation), and statistical methods are used to compute the test result (MOSc, ‘c’ for communication) and the confidence interval. Interactive tests are very difficult to control, and consistency and repeatability are very hard to obtain.
An example of the sample conditions used in international experiments conducted by ITU-T when selecting a 8-kbit/s candidate is given in Table 2.20.

Table 2.19 DMOS table

Degradation is inaudible 5
Degradation is audible but not annoying 4
Degradation is slightly annoying 3
Degradation is annoying 2
Degradation is very annoying 1

Table 2.20 Typical ITU experiments for coder selection

Experiment #	Description
Experiment 1	Clean speech quality and random bit error performance
Experiment 2	Tandem connection and input-level dependence
Experiment 3	Frame erasure: random and burst
Experiment 4	Car noise, bubble noise, multiple speakers, and music
Experiment 5	Signaling tones: DTMF, tones, etc.
Experiment 6	Speaker dependence: male, female, child

2.8.4

Usage of MOS

As MOSs represent a mean value, extreme care must be taken to select or promote a speech coder for a specific application. It must be checked that all the candidate coders are evaluated under the same conditions (clean speech, level dependence, background noise of several types and level of noise, sensibility to bit errors, frame erasure, etc.) and that the test conditions actually represent the real conditions of the communication channels used by the application. International bodies, such as the ITU-T, TIA, ETSI, JDC, are well aware of the situation and evaluate each coder according to a rigorous and thorough methodology. Too often, manufacturers publish and promote MOS results that have no scientific value. A few examples of common tricks are:
• Publishing good MOS results with a high network loss rate (10%!), but with a carefully engineered loss pattern that does not represent the real situation (e.g., exactly one packet out of 33 is lost, as opposed to the correlated packet loss in a real network).
• Taking a higher MOS value for the reference coder, but omitting this detail in the final test documentation.
• Using test samples free from background noise.
• Using listening equipment of low quality that smooths the perception of all coders and therefore boosts the results of the tested coder after normalization.