The Design of VoIP Systems with High Perceptual Conversational Quality - Ubiquitous Multimedia Computing

Information Technology Reference

In-Depth Information

techniques. Waveform codecs, such as ITU G.711 and G.726 [11], were designed

to reconstruct a sample-wise waveform as closely as possible. Parametric

codecs, such as G.722.2, G.723.1, G.728, and G.729A [11], model the production

of speech in order to reconstruct a waveform that perceptually resembles the

original speech. Hybrid codecs, such as G.729.1 [11], iLBC [44], and iSAC [43],

combine techniques from both. Under no-loss conditions, the perceptual

quality of a codec is a function of its coding technique and bit rate. However,

it is difficult to compare codecs under loss conditions.

Parametric and hybrid codecs are popular in VoIP because they have lower

bit rates and better perceptual quality. By controlling the frame size and the

frame period, their design involves trade-offs among robustness, quality, and

algorithmic delay [44]. Frame size generally varies between 10 bytes (G.729A)

and 80 bytes (G.729.1, 32-kbps wideband mode), with multimode codecs hav-

ing a wide range. (For example, G.722.2 has frames between 17 and 60 bytes.)

Frame period varies between 10 ms (G.729A) and 60 ms (iSAC with 30-60 ms

adaptive size). The most common periods are 20 ms (such as iLBC 15.2-kbps

mode, G.729.1 and G722.2) and 30ms (iLBC 13.3-kbps mode and G723.1). In

general, a larger frame with a shorter period achieves higher quality within

the multiple modes of a codec or within a family of codecs with similar cod-

ing techniques. However, a longer frame and look-ahead window may incur

more algorithmic delay.

Figure 2.9a summarizes some of the LC techniques employed in speech

codecs. Receiver-based schemes that do not use redundant information can

be classified into sample-based and model-based schemes. In early systems,

silence or comfort-noise substitution [24] or repetition of the previous frame

[26] was proposed in place of a lost frame. Also proposed was the transmis-

sion of even and odd samples in separate packets and using sample-based

interpolation when a packet was lost [25]. Early model-based schemes simply

repeat the codec parameters of the last successfully received frame [28]. Later,

interpolation of codec parameters from the previous and the next frame was

proposed [27]. Other schemes utilized the information about the voiced-

unvoiced properties of a speech frame to apply specialized LC for reducing

the perception of degradation [29,30]. Schemes that require the cooperation

of the sender and the receiver utilize partial redundant information [31-33]

that is made available by the packet-stream-level LC (see Section 2.4).

The trade-offs between frame size and frame period are different from

those between packet size and packet period in the packet-stream layer. To

avoid excessive losses in the Internet, it is important to choose an appropriate

packet period, as long as packets are smaller than an MTU of 576 bytes and

can be sent without fragmentation [45]. (In practice, an MTU of 1,500 bytes

will not cause fragmentation.) For VoIP using IPv4, a packet period between

30 ms and 60 ms generally works well. When the frame period is shorter

than the packet period, multiple frames have to be encapsulated in a packet

before they are sent. For some codecs, the loss of a single packet can cause

a misalignment of its internal states and degrade its decoded output. For

Ubiquitous Multimedia Computing

Search WWH ::

Custom Search

Home