VoIP Fundamentals (Considering VoIP Design Elements) Part 1

Voice over IP (VoIP) introduces additional challenges into a network design. Some of these challenges stem from the necessity of providing a perceptible level of voice quality to end users, while efficiently using available bandwidth.

VoIP design also requires additional processing components not necessary in traditional data networks. Specifically, coder/decoders (that is, codecs) convert the spoken voice into a data stream. Some codecs compress voice to consume less bandwidth. However, compressing voice can degrade the quality of the voice. The loss of voice quality could cause issues when sending tones (for example, Dual Tone Multifrequency [DTMF] tones, modem tones, and fax tones) across a network.

Some voice processes require significant processor overhead. For example, if your voice network requires you to convert between a high-bandwidth codec and a low-bandwidth codec, you need to perform transcoding. The act of transcoding requires dedicated hardware called digital signal processors (DSPs). This topic addresses voice quality issues, offers voice quality solutions, and describes VoIP hardware components (for example, codecs and DSPs).

The inherent characteristics of a converged voice and data IP network cause network engineers and administrators to face certain challenges in delivering voice traffic correctly. This section describes the challenges of integrating a voice and data network and offers solutions for avoiding problems when designing a VoIP network for optimal voice quality.

IP Networking and Audio Clarity

Because of the nature of IP networking, voice packets sent via IP are subject to certain transmission problems. Conditions present in the network might introduce problems such as echo, jitter, or delay. These problems must be addressed with quality of service (QoS) mechanisms.

The clarity (that is, the "cleanliness" and "crispness") of the audio signal is of utmost importance. The listener must be able to recognize the identity and sense the mood of the speaker. The following factors can affect clarity:

■ Fidelity: Fidelity is the degree to which a system, or a portion of a system, accurately reproduces at its output the essential characteristics of the signal impressed upon its input, or the result of a prescribed operation on the signal impressed upon its input (definition from the Alliance for Telecommunications Industry Solutions [ATIS]). The bandwidth of the transmission medium almost always limits the total bandwidth of the spoken voice. Human speech typically requires a bandwidth from 100 to 10,000 Hz, although 90 percent of speech intelligence is contained between 100 and 3000 Hz.

■ Echo: Echo is a result of electrical impedance mismatches in the transmission path. Echo is always present, even in traditional telephony networks, but at a level that cannot be detected by the human ear. The two components that affect echo are amplitude (loudness of the echo) and delay (the time between the spoken voice and the echoed sound). You can control echo using suppressors or cancellers.

■ Jitter: Jitter is variation in the arrival of coded speech packets at the far end of a VoIP network. The varying arrival time of the packets can cause gaps in the recreation and playback of the voice signal. These gaps are undesirable and annoy the listener. Delay is induced in the network by variation in the routes of individual packets, contention, or congestion. You can resolve variable delay by using dejitter buffers.

■ Delay: Delay is the time between the spoken voice and the arrival of the electronically delivered voice at the far end. Delay results from multiple factors, including distance (propagation delay), coding, compression, serialization, and buffers.

■ Packet Loss: Voice packets might be dropped under various conditions such as an unstable network, network congestion, or too much variable delay in the network. Lost voice packets are not recoverable, resulting in gaps in the conversation that are perceptible to the user.

■ Side tone: Side tone is the purposeful design of the telephone that allows the speakers to hear their spoken audio in the earpiece. Without side tone, the speaker is left with the impression that the telephone instrument is not working.

■ Background noise: Background noise is the low-volume audio that is heard from the far-end connection. Certain bandwidth-saving technologies can eliminate background noise altogether, such as voice activity detection (VAD). When this technology is implemented, the speaker audio path is open to the listener, while the listener audio path is closed to the speaker. The effect of VAD is often that speakers think the connection is broken because they hear nothing from the other end. Therefore, VAD is often combined with comfort noise generation (CNG) to prevent the illusion that the call has been disconnected.

The following sections cover some of these in more detail.


Jitter is defined as a variation in the arrival of received packets. On the sending side, packets are sent in a continuous stream with the packets spaced evenly. Because of network congestion, improper queuing, or configuration errors, this steady stream can become uneven because the delay between each packet varies instead of remaining constant, as displayed in Figure 2-1.

Jitter in IP Networks

Figure 2-1 Jitter in IP Networks

When a router receives an audio stream for VoIP, it must compensate for the jitter that is encountered. The mechanism that handles this function is the play out delay buffer, or dejitter buffer. The play out delay buffer must buffer these packets and then play them out in a steady stream to the DSPs to be converted back to an analog audio stream. The play out delay buffer, however, affects overall absolute delay.

When a conversation is subjected to jitter, the results can be clearly heard. If the talker says, "Watson, come here. I want you," the listener might hear, "Wat….s…on…….come here, I……wa……nt……..y……ou." The variable arrival of the packets at the receiving end causes the speech to be delayed and garbled.


Overall or absolute delay can affect VoIP. You might have experienced delay in a telephone conversation with someone on a different continent. The delays can cause entire words in the conversation to be cut off and can therefore be very frustrating. Figure 2-2 illustrates various areas in the network that can introduce delay.

Sources of Delay

Figure 2-2 Sources of Delay

When you design a network that transports voice over packet, frame, or cell infrastructures, it is important to understand and account for the predictable delay components in the network. You must also correctly account for all potential delays to ensure overall network performance is acceptable. Overall voice quality is a function of many factors, including the compression algorithm, errors and frame loss, echo cancellation, and delay.

Following are the two distinct types of delay:

■ Fixed delay: Fixed-delay components are predictable and add directly to overall delay on the connection. Fixed-delay components include the following:

■ Coding: The time it takes to translate the audio signal into a digital signal

■ Packetization: The time it takes to put digital voice information into packets and remove the information from packets

■ Serialization: The insertion of bits onto a link

■ Propagation: The time it takes a packet to traverse a link

■ Variable delay: Variable delays arise from queuing delays in the egress trunk buffers that are located on the serial port connected to the WAN. These buffers create variable delays, called jitter, across the network.

Acceptable Delay

International Telecommunication Union Telecommunication Standardization Sector (ITU-T) specifies network delay for voice applications in Recommendation G.114. This recommendation defines three bands of one-way delay, as shown in Table 2-1.

Table 2-1 Acceptable Delay: G.114

Range in Milliseconds


0 to 150

Acceptable for most user applications.

150 to 400

Acceptable, provided administrators are aware of the transmission time and its impact on the transmission quality of user applications.

Above 400

Unacceptable for general network planning purposes. (However, it is recognized that in some exceptional cases, this limit will be exceeded.)

Note This recommendation is for connections with echo that are adequately controlled, implying that echo cancellers are used. Echo cancellers are required when one-way delay exceeds 25 ms (G.131).

The G.114 recommendation is oriented toward national telecommunications administrations and, therefore, is more stringent than recommendations that would normally be applied in private voice networks. When the location and business needs of end users are well known to a network designer, more delay might prove acceptable. For private networks, a 200 ms delay is a reasonable goal and a 250 ms delay is a limit. This goal is what Cisco Systems proposes as reasonable as long as excessive jitter does not affect voice quality. However, all networks must be engineered so the maximum expected voice connection delay is known and minimized.

Calculating Delay Budget

The G.114 recommendation is for one-way delay only and does not account for round-trip delay. Network design engineers must consider both variable and fixed delays. Variable delays include queuing and network delays, and fixed delays include coding, packetization, serialization, and dejitter buffer delays. Table 2-2 offers a sample delay budget calculation.

Table 2-2 Delay Budget Calculations

Delay Type

Fixed (ms)

Variable (ms)

Coder delay


Packetization delay


Queuing and buffering


Serialization (64 kbps)


Table 2-2 Delay Budget Calculations

Delay Type

Fixed (ms)

Variable (ms)

Network delay (public frame)



Dejitter buffer





Packet Loss

Lost data packets are recoverable if the endpoints can request retransmission. Lost voice packets, as depicted in Figure 2-3, are not recoverable, because the audio must be played out in real-time and retransmission is not an option.

Effect of Packet Loss

Figure 2-3 Effect of Packet Loss

Voice packets might be dropped under the following conditions:

■ The network is unstable (flapping links).

■ The network is congested.

■ Too much variable delay exists in the network, because packets might arrive too late to be admitted into an interface’s dejitter buffer.

Packet loss causes voice clipping and skips. As a result, the listener hears gaps in the conversation, as shown in Figure 2-3. The industry standard codec algorithms that are used in Cisco DSPs correct for 20 ms to 50 ms of lost voice through the use of Packet Loss Concealment (PLC) algorithms. PLC intelligently analyzes missing packets and generates a reasonable replacement packet to improve the voice quality. Cisco VoIP technology uses 20 ms samples of voice payload per VoIP packet by default. Effective codec correction algorithms require that only a single packet can be lost at any given time. If more packets are lost, the listener experiences gaps.

If a conversation experiences packet loss, the effect is immediately heard. If the talker says, "Watson, come here. I want you," the listener might hear, "Wat–, come here, —you."

Next post:

Previous post: