Adaptive jitter buffers can take several inputs and can arrive at the best possible packet delivery while maintaining least possible buffering delay. Several algorithms for AJB exist [Ramjee et al. (1994), Pinto and Christensen (1999), Tseng et al. (2004) , Moon et al. (1998)]. In this section, some popular AJB
concepts are given. Each algorithm has several microlevel details and extra proprietary operational steps. The appropriate algorithm for playout delay adjustment should be chosen based on the requirements of the deployment conditions, quality goals, and available parameters from RTCP, RTCP-XR, and QoS. In a wireless environment, the network and the end-terminal conditions change frequently, and hence, a faster adaptive algorithm based on per packet interval has to be chosen. The playout adjustment algorithms are of two types based on packet adjustments.
1. Talk-spurt based—adjusts adaptively during silence periods.
2. Non-talk-spurt based—adjusts on a per-packet basis or at regular time intervals.

Talk-Spurt-Based Adjustments

Talk-spurt is a significant speech zone. A talk-spurt is defined as a continuous section of speech at least 300 ms in duration, containing no silent period longer than 200ms [ITU-T-P.862 (2001)]. Talk-spurt is referred to as an utterance in voice quality measurement such as P.862. As shown in Fig. 10.4(a) with tile diagram markings, talk-spurts or utterances in solid boxes are separated by silence regions. In VAD/comfort noise generator (CNG) enabled mode, the compression codecs detect the silences as VAD silence frames or nonspeech frames. By retaining full content of talk-spurt and adjusting silence-zones, voice quality will not degrade. Adjustment of the silence zone is to increase or decrease the duration of the silence zone between the talk-spurts. Several initial articles [Ramjee et al. (1994) , Pinto and Christensen (1999)] are based on a talk-spurt scheme for fixed and variable packet drops. In relation to Fig. 10.4(a), an initial jitter buffer operation will start with set initializations. The silence zones are adjusted based on the derived parameters such as packet drop and optimum playout delay from the previous talk-spurt. After adjusting the silence zone, optimum playout delay is applied on the next talk- spurt. In the regions of identified talk-spurts, jitter buffer will not allow any adjustments.
The main advantage of the talk-spurt-based scheme is improved voice quality. Speech MOS measuring instruments and human ear are not sensitive to small adjustments of silence periods. The main disadvantage of this scheme is the difficulty in detecting talk-spurt. No set algorithms are governed by standards and recommendations. Some standardized techniques used in speech quality measurements operate on several talk-spurts of data to decide on the best boundaries of talk-spurt and silence zones. When background activity is more or contains several continuous test signals, talk-spurt may not be detected. Hence, support of a non-talk-spurt scheme is also essential for working independently or for supplementing the operations of a talk-spurt-based scheme. Spike is a sudden long variation of delay that has to be treated as long bursts.
Spike can happen at any region of packets of silence or talk-spurt. Spike characteristics are presented in the later part of this topic.

Non-Talk-Spurt-Based Adjustments

In a non-talk-spurt-based scheme, the parameters are extracted for every packet and jitter estimates are made continuously. In theory, the jitter buffer adjustments can happen for every new packet. In practice, these parameters are smoothed over several packet intervals, and adjustments are applied at regular intervals of a few hundred milliseconds. In Fig. 10.4(b), the adjustments are shown at selected intervals, and these intervals may change with the jitter characteristics. In the case of spike detection, the parameter updates can change at the transitions of spike and normal modes to adapt to the spiky conditions. After the end of the spike, the adjustments will be performed in normal mode. The parameter updates and jitter estimates will work based on the underlying algorithms, timing, logic, configurations, and set quality goals.
In Fig. 10.4, the spike occurrence is indicated in Fig. 10.4(b). Spike detection operation is applicable to both talk-spurt and non-talk-spurt mode. In general, it is simpler to use talk – spurt -based spike detection of Fig. 10.4(a). In Fig. 10.4(a), spike operations are not marked.
AJB adjustments. (a) Talk-spurt-based silence adjustments principle. (b) Non-talk-spurt-based adjustments.
Figure 10.4. AJB adjustments. (a) Talk-spurt-based silence adjustments principle. (b) Non-talk-spurt-based adjustments.

Voice Flow and Delay Variations Mapping

The packet flow and impediments are indicated in Section 10.3. In this section, voice and packet flow are directly mapped to the delay and jitter (delay variations). Figure 10.5 is a simplification of Fig. 10.3 and is represented with required blocks that relate to delay and jitter. The voice and packet flow is shown from phone – A to phone – B.
Voice transmitting through phone- A is compressed in the encoder. The encoder creates compressed payload. The payload as frames of data is pack-etized in RTP. From phone-A to the RTP input, the delays are of a fixed nature. Depending on the processor architecture and communication mechanisms, small delay variations could occur at a sub- millisecond level. This delay is negligible compared with other IP packet impediments. The block shown with networking creates complete packets for transmitting on the network. From RTP input to launching of the packet on the IP network, the packet goes through fixed delay and a variable delay of few milliseconds. The variable delay is maintained to be of the order of less than 5 ms. More details on this jitter at network interface are given in topic 18.
In the IP network, several routers and switches can create network impediments to the packet flow. IP network and network interfaces are the causes of major impediments in most situations. At the destination, the packet goes
 Adaptive jitter buffer end-to-end influences. (a) Delay and delay variation contributors mapped to VoIP voice call. (b) AJB buffer delays and end-to-end delay representation.
Figure 10.5. Adaptive jitter buffer end-to-end influences. (a) Delay and delay variation contributors mapped to VoIP voice call. (b) AJB buffer delays and end-to-end delay representation.
through networking and RTP blocks that introduce a few milliseconds of fixed and variable delay. Jitter buffer that interfaces RTP output has to create buffering to the variable delay packets. Jitter buffer converts all end-to-end variable delays to fixed buffer delay. Jitter buffer output is read from the decoder in a synchronous way at voice decompression frame intervals. Jitter buffer cannot take care of any of the voice encoder, decoder, analog front end, and loop length delay variations. Jitter buffer, when set properly, creates an end-to-end steady call to behave like a fixed long delay between phone-A and B.
In Fig. 10.5(b), delays are represented with tile boxes with text. The horizontal length represents the time, delay duration, or delay variation. The analog front end and both end payload creation delays are shown as fixed. All variable part of delay are grouped and shown in a dotted box. When variable delays are close to zero, jitter buffer will operate with minimum delay, which is usually of the order of 20 ms. When jitter exceeds the AJB minimum threshold, AJB keeps growing to accommodate all possible packets. The effort would be to transfer at least 99% of the available packets through jitter buffer.
The IP network characteristics or end-to-end jitter may change. Some adverse influences would be through a spike. A spike happens suddenly and behaves like huge jitter. Jitter buffer will try to grow, but it may limit the growth to reduce delay. In the process of delay optimization, a certain amount of packet discard (drop by jitter buffer) may be accepted. In most situations, packet drop in the jitter buffer adjustments is maintained to less than 1%. In some conditions, jitter buffer input may have several packet drops. Jitter buffer will try to deliver the best part of the available packets.

Next post:

Previous post: