VOICE QUALITY MEASUREMENTS (VoIP)

20.1
For comparing the voice quality, and to arrive at the parameters contributing to the voice quality, it is required to know how voice quality is measured and what are the quality goals. The basic test setup for VoIP voice measurements is given in topic 13. In this topic, voice quality as mean opinion scores (MOS), various classifications, voice quality influencing parameters, and improvements are discussed. Voice quality measurements for MOS are classified as subjective and objective.
A functional representation of some popular voice quality measurement techniques is illustrated in Fig. 20.1 . In the figure, voice is shown to be from

Table 20.1. PSTN and VoIP Quality Comparisons

Attributes	PSTN	VoIP
Distortions on	Distortions due to several	No analog transmission
analog line	1000-feet lines from	distortions with VoIP calls.
	DLC or CO location
Echo cancellation	Achieved through loss	Carrier grade echo cancellers
on national calls	planning and low delays	are used.
Automatic gain	Not incorporated	Possible to incorporate for
control		better perception of speech
		levels or listening quality
		experience.
Voice quality	Monitoring such as GR-	RTCP- XR and GR – 909 are
monitoring	909 are incorporated	incorporated in many VoIP
	into the PSTN	deployments.
Bandwidth or bit	64 kbps fixed on digital	Variable bandwidth, usually
rate	TDM. DCME channels	requires more on physical
	use 16, 24, 32, and	interfaces than the PSTN. Fax
	40 kbps that degrades fax	services can get more
	Quality	bandwidth or redundancy in
		transmission.
Fax calls	Performance limited by the	Uses short end lines. Hence, fax
	end transmission line	delivery can be better using
	characteristics	VoIP. However, there could
		be interoperability issues for
		sending fax.
Voice and data	Mainly for voice calls,	Internet service and VoIP can
	some services may reuse	scale along with data and
	voice channels for data	media service requirements.
Voice call features	Limited features and	Several features are offered as
	expensive for several	free.
	service features
Voice interfaces	Limited interfaces	Multiple interfaces and services.
Long distance	Long distance is costly	Usually free or much lower
		rates.
Transcoding	Multiple levels of	End- to – end direct coding can be
	transcoding for inter-	employed based on the
	regional calls	available support.
Wideband support	Voice calls are of	Wideband end – to – end voice is
	narrowband	possible that can exceed
		PSTN quality.

the sending gateway to the receiving gateway. The receiving gateway is shown with some more expanded blocks for creating a big picture of the E-model, which is used for R-factor estimation, additional quality metrics, and Real-Time Transport Control Protocol-Extended Reports (RTCP-XR) operation. In the E-model, RTP, RTCP, jitter buffer, and total system signal parameters are used. When on calculating the R-factor and other derived parameters,

Figure 20.1. Overview on popular voice quality measurements.
RTCP-XR can send packets to the internal applications, destination gateway, and RTCP-XR server. In summary, the nonintrusive R-factor is an objective estimation that resides as part of VoIP implementation, and additional software is required in the gateway for the R-factor estimation. In perceptual evaluation of speech quality (PESQ), instruments like MultiDSLA [URL (DSLAII)] send the reference speech through the VoIP system under test and evaluate the degraded with the reference speech. This measurement is active, and VoIP gateways do not need to know anything about the measurement. In subjective listening, multiple listeners will evaluate the voice quality. In P.563, voice is analyzed entirely on the received degraded signal and the original reference is not required. P.563 is similar to subjective listening, but it is evaluated by the instruments or processors. Each of these techniques arrives at a different scale of voice quality. In a VoIP voice call between A and B, voice measurements are made as half-duplex, which means measurements are made as A to B or B to A, one at a time. Because of the half-duplex listening type of testing, these measurements are referred to as listening quality (LQ) tests. The suffix LQ is appended while presenting the results on half-duplex tests, and objective tests are additionally suffixed with “O” as LQO.
20.1.1

Subjective Measurement Technique

In subjective voice quality evaluation, voice quality MOS is rated by the group of actual male and female listeners. It is the actual listening test for evaluating the MOS. The P.800 and P.830 [ITU-T-P.830 (1996), ITU-T-P.800 (1996)] recommendations are used for assessing the subjective performance of speech
codecs. The same tests are extended to the VoIP voice quality. A group of people participates for recording subjective scores. Multiple test phrases are recorded and then test subjects (group of people) listen to them in different conditions. These tests are performed in special rooms with background noises and other environment factors are kept under control for test execution. The test conditions are given in [ITU-T-P.800 (1996)]. The subjective measurement techniques are categorized as absolute category rating (ACR), degradation category rating (DCR), and comparison category rating (CCR).
I n ACR, participants listen to recorded speech samples that have been processed through several test connections. A minimum of 16 test subjects (listeners) should participate in the assessment. When listening, users rate the call on a 1 to 5 MOS scale. The average values of the user ratings are considered to generate the overall call quality.
In a DCR test, two speech samples are present. The first speech sample is a reference sample with predefined quality. The sample here refers to speech lasting for several seconds in duration. The other speech sample is a degraded version. Listeners must compare the degraded version with a reference on a degradation scale of 1 to 5. Here, 5 is inaudible degradation and 1 represents worst degradation. The results are summarized as degraded MOS.
In CCR tests, users are asked to listen to two sets of samples, one corresponding to reference and the other to degraded. This test is similar to DCR, except that the order of samples presented to the listeners are changed in different iterations. The order of reference and degraded is not declared to the listener. Listeners are asked to give a comparative rating of a second sample with respect to the first one on a scale of -3 to 3 as per P.800 Annex-D [ITU-T-P.800 (1996)]. In presenting the results, “3″ represents much better quality and “-3″ represents the worst quality on a relative scale. The quality score is mapped to MOS. The MOS rating allowed is 1 to 5, but a user rating above 4.5 is limited to 4.5.
Subjective tests are involved in procedures, and it is a costly effort. It is limited to less iterations to evaluate any new algorithm or speech codecs. It is difficult to maintain consistency like instrument-based objective tests.
20.1.2

Objective Measurement Techniques

Objective methods are the measurements and calculations. It is expected that results will be consistent across several measurements. Several objective methods exist and are classified as active and passive methods.
• Active monitoring techniques of PESQ [ITU-T-P.862 (2001)]
• Passive monitoring techniques of P.563 and the E-model [ITU-T-P.563 (2004), ITU-T-G.107 (2005)]
Active Monitoring Techniques. Active measurement is called intrusive monitoring or offline monitorings because of involvement of external signals.
In an effort to supplement subjective listening quality, testing with lower cost objective methods are developed. KPN developed the P.861 (this is obsolete now) perceptual speech quality measure (PSQM) for the evaluation of codec performance. British Telecom developed the perceptual analysis measurement system (PAMS) for network measurements. The P.862 PESQ resulted from an ITU competition. The performance of PAMS and a new version of PSQM, PSQM99, were similar so the contributors were invited to combine the algorithms. This resulted in PESQ, which is slightly better than its constituents.
These methods measure distortion introduced by a transmission system and codec by comparing an original reference file sent into the system on a telephone interface with the received impaired signal received on another telephone interface. PSQM was developed for laboratory testing of speech codecs. PAMS and PESQ are designed for network testing. The use of instruments for voice quality is much simpler compared with subjective or passive measurements. Instrument suppliers are also providing the extra-derived parameters to help identifying the sources of degradations through measurements. Refer to some instruments given in topic 13 for more details on various features.
While writing this topic, PESQ was popularly supported in the instruments. PESQ was approved by the ITU in March 2001 as the P.862 recommendation, replacing P.861 PSQM. The PESQ combined several best merits of PAMS and PSQM. It is accurate in predicting subjective test scores, and it is robust under severe network conditions such as a variable delays, filtering at analog interfaces, and support of both wideband and narrowband. PESQ produces a score that lies on a scale from -0.5 to 4.5. A mapping function from a P.862 PESQ score to an average subjective P.800-LQ MOS score was provided, making it
PESQ – LQO [ITU - T- P.862.1 (2003) ] for narrowband voice. LQO denotes a listening quality objective. PESQ-LQ lies from 1 to 4.5. A MOS of 4.5 is the maximum quality achieved for a clear undistorted condition. An overview on the PESQ algorithm is given here. It is suggested to refer to the ITU P.862 family of recommendations, software, and some commercial instrument brochures for more details [URL (DSLAII)].
20.1.3

PESQ Measurement

Human auditory perception is the core concept behind PESQ and its predecessors PAMS and PSQM. A perceptual model is used to distinguish correctly between audible and inaudible distortions, and this has proven to be the best way of accurately predicting the audibility and annoyance of complex distortions. In addition to the quantity of distortion, the distribution of audible distortion could make quality predictions much more accurate.
PESQ measures one-way voice quality, which means the half-duplex operation of measurement. It assesses the quality of a distorted speech signal that has been coded and transmitted over the network by comparing it with the original undistorted signal. The original and distorted speech is mapped on to psychophysical representations that match the way humans experience speech.
The quality of the distorted speech is judged based on differences in psychophysical representations. The PESQ operation makes use of two major classes of logarithmic operations—namely conversion of signals into the psycho-acoustic domain and cognitive modeling. A functional representation of the PESQ algorithm is given in Fig. 20.2. Instrument manufacturers for the PESQ measurement include several extra operations to extract signal analysis parameters and impairments in addition to PESQ measurements.

Figure 20.2. PESQ algorithm functional representations.
The processing carried out by the PESQ algorithm includes the stages listed below. Summary steps are given here; several details on the PESQ are given in [ITU-T-P.862 (2001), Rix et al. (2002), Beerends et al. (2002)].
In the first step of processing, both the reference and the degraded signal are scaled to the same constant power level. This scaling is necessary because the reference signal does not have to be at a defined level and the gain of the system under test is unknown before testing. PESQ assumes that the subjective listening level is a constant 79 dBSPL at the ear reference point [ITU-T-P.830 (1996)]. For power normalization, electrical signal levels are normalized to -26dBov (i.e., -20dBm as given in the reference [URL (DSLA-usrgd), Malfait et al. (2006)]). A signal-level normalization is applied to both the reference and the degraded signal to bring them to this level.
Perceptual models such as PESQ should take into account the characteristics of the telephone handsets as subjective listening may use telephone handsets. In PESQ, the receive path of the handsets is modeled using an intermediate reference system (IRS) band-pass filter [ITU-T-P.830 (1996)] in the frequency domain. This process takes into account the effects of the electrical and acoustic components of the handset. Both the reference and the degraded signal are IRS filtered.
The system under test may include variable delay. To compare the reference and degraded signals, both signals are time aligned with each other. PESQ aligns overlapping sections of the speech frames. In the first stage, the delay estimation is carried out over the length of files by computing the correlation between the files. The delay obtained in this stage is called crude delay. In the next stage, PESQ applies voice activity detection to the signals to identify required speech segments usually referred to as utterances. The delay estimate between utterances is the fine delay. This process detects delay that is variable over the length of an utterance, as this can be significant in packet-based networks.
The time-aligned reference and degraded signals are transformed into the frequency domain by using a short-term fast Fourier transform (FFT) with a Hanning window over 32-ms frames with 50% overlapping. The powers of original and degraded signals are computed and stored separately. In the next stage of operations, the frequency bands are transformed to bark scale by binning FFT bands. This process warps the frequency scale in Hz to the pitch scale, and the resulting signals are called pitch power densities. In this process, higher bandwidth is used for a high -frequency signal derived through frequency analysis.
The filtering effects in the system under test are equalized by computing a partial compensation factor per each bark bin and by multiplying each frame of the reference signal with this factor. This process equalizes the reference to the degraded signal. The compensation factor is computed as the ratio of degraded signal spectrum to reference signal spectrum. This factor takes into account the filtering at analog components of the network such as telephone handsets. In the second stage of equalization, the frame-by-frame amplitude
gain of the system is estimated and used to equalize the degraded signal to the reference signal. In both cases, the equalization is partial and large amounts of filtering or gain variation are not cancelled; therefore, it results in errors being measured. The frequency and gain-equalized pitch power densities are transformed to loudness scale using Zwicker’s law [ITU-T-P.862 (2001)]. The resulting time-frequency components are called loudness densities.
The signed difference between the loudness densities for the reference and degraded signals is known as raw disturbance density, which shows any audible differences introduced by the system under test. A masking operation applies a mask factor on the raw disturbance densities that masks the small inaudible distortions in the presence of loud signals. The disturbance density obtained by this process is called absolute or symmetric disturbance density. The symmetric disturbances are integrated over the length of the frame (intraframe). The consecutive frames with a frame disturbance above a threshold are categorized as bad frames. The bad frames may occur because of incorrect time delay estimation or packet drops. On a localized window around bad frames, a new delay estimate is made that is used to recompute the disturbance densities. The minimum of the previous and current disturbances is considered as the final disturbance in that bad frame window.
To model the distortion introduced by the codec used in the network, an asymmetric disturbance density is calculated by multiplying the symmetric disturbance density with an asymmetry factor. The asymmetry factor is the ratio of distorted and the original pitch power densities raised to the power of 1.2. This disturbance density is called an additive or asymmetric disturbance.
Finally, the error parameters are converted to a quality score, which is a linear combination of the average symmetric disturbance value and the average asymmetric disturbance value. From Fig. 20.2, the stages involved from level alignment to the intensity warping on the loudness scale are known as the conversion to the psycho- acoustic domain, and the algorithmic stages from perceptual subtraction to PESQ score computation are known as cognitive modeling.
PESQ gives a score known as the PESQ score in accordance to P.862. The PESQ score is in the range of -0.5 to 4.5. PESQ is correlated to the subjective MOS as 0.94 based on experiments conducted on databases by [Malfait et al. (2006)]. Compared with subjective (actual listeners) scores, PESQ gives better results for poor quality speech and pessimistic results for good quality voice. PESQ-LQ provides better correlation with subjective scores than PESQ on a listening quality scale. PESQ-LQ scores are in range of 1 to 4.5. P862.1 provides a quality mapping between narrowband quality measurements PESQ score and listening quality objective mean opinion score (MOS-LQO). Recommendation P.862.2 provides a quality mapping between wideband quality measurements PESQ score and listening quality objective mean opinion score. More information on these scores can be found in the ITU-T-P.862 series recommendations and in reference [URL (DSLAII)].
PESQ is a half-duplex operation that will not capture accurately on end-to-end delay, echo, loudness loss, sidetone, and listening levels. From the Voice Quality measurement of the VoIP gateway with analog interfaces, the following PESQ-LQO observations are made using DSLA [URL (DSLAII)]. Under the no packet loss condition, the PESQ-LQO score for the G.711 codec is 4.32, G.729A is 3.85, and G.723.1 is 3.75. Another interpretation of these results for packet drop situations and comparison with the E-model are given as part of the R-factor calculations and presented in Table 20.4. In the process of PESQ calculations, several other parameters can be computed. Instrument suppliers provide these parameters as additional features to PESQ measurements [URL (DSLAII) ].
20.1.4

Passive Monitoring Technique

I n passive monitoring techniques, the reference signal is not present. Two popular methods for passive speech quality monitoring exist. The ITU has standardized a signal based nonintrusive monitoring method, P.563, based on the result of collaboration among three companies, Psytechnics Ltd., Swissqual, and Opticom, which combined the best parameters of three different models. P.563 is a single- ended objective measurement that makes use of a speech production mechanism, and the other speech models make use of listening perception. This algorithm operates on received degraded speech only. It will not need reference speech, and it entirely operates on degraded speech. The measurements through P.563 derive several parameters from received speech classified as noise, artificial speech, and actual speech. An overview on the P.563 single-ended speech-quality assessment operation is given here.
In the absence of a reference signal, the models do not have knowledge of the original signal and assumptions have to be made about the received signal. The P.563 model combines three basic principles for evaluating distortions. The first principle focuses on the human voice production system, modeling the vocal tract as a series of tubes, with abnormal variations of the tubes’ sections considered as degradation. The second principle is to reconstruct a clean reference signal from the degraded signal in order to apply a full-reference perceptual model thereafter and to assess distortions unmasked during the reconstruction. The third principle is to identify and to estimate specific distortions encountered in voice channels, such as temporal clipping, robotization, and noise. Listening speech quality is derived from the calculated parameters from the three principles, applying a distortion-dependent weighting.
While writing this topic, the P.563-based technique was not widely accepted for measurements. P.862 PESQ-based measurements and E-model-based estimations are more popularly accepted. The main advantage of this P.563 technique is its ability to monitor at the degraded end without calling for reference. Thus, it can better monitor long-distance calls outside the laboratory and in deployments, which will be much simpler to conduct than many other measurements. The P.563-based method can also be embedded as part
of the receiving gateway similar to E-model and RTCP-XR. P.563 operations can be used on samples that get delivered on the pulse code modulation (PCM) voice interfaces.
More information on the P. 563 technique can be found from P.563 [ITU-T-P.563 (2004)] and [Malfait et al. (2006)]. The MOS score produced by P.563 and other techniques is widely spread and is necessary to average the results of multiple tests to achieve a stable quality metric over multiple results. P.563 is correlated with subjective MOS as 0.85 to 0.9 based on the experiments conducted on a database by [Malfait et al. (2006)], and PESQ is reported as 0.94.