Error Control (Data Communications and Networking)

Before learning the control mechanisms that can be implemented to protect a network from errors, you should realize that there are human errors and network errors. Human errors, such as a mistake in typing a number, usually are controlled through the application program. Network errors, such as those that occur during transmission, are controlled by the network hardware and software.

There are two categories of network errors: corrupted data (data that have been changed) and lost data. Networks should be designed to (1) prevent, (2) detect, and (3) correct both corrupted data and lost data. We begin by examining the sources of errors and how to prevent them and then turn to error detection and correction.

Network errors are a fact of life in data communications networks. Depending on the type of circuit, they may occur every few hours, minutes, or seconds because of noise on the lines. No network can eliminate all errors, but most errors can be prevented, detected, and corrected by proper design. IXCs that provide data transmission circuits provide statistical measures specifying typical error rates and the pattern of errors that can be expected on the circuits they lease. For example, the error rate might be stated as 1 in 500,000, meaning there is 1 bit in error for every 500,000 bits transmitted.

Normally, errors appear in bursts. In a burst error, more than 1 data bit is changed by the error-causing condition. In other words, errors are not uniformly distributed in time. Although an error rate might be stated as 1 in 500,000, errors are more likely to occur as 100 bits every 50,000,000 bits. The fact that errors tend to be clustered in bursts rather than evenly dispersed is both good and bad. If the errors were not clustered, an error rate of 1 bit in 500,000 would make it rare for 2 erroneous bits to occur in the same character. Consequently, simple character-checking schemes would be effective at detecting errors.But burst errors are the rule rather than the exception, often obliterating 100 or more bits at a time. This makes it more difficult to recover the meaning, so more reliance must be placed on error detection and correction methods. The positive side is that there are long periods of error-free transmission, meaning that very few messages encounter errors.

Sources of Errors

Line noise and distortion can cause data communication errors. The focus in this section is on electrical media such as twisted-pair wire and coaxial cable, because they are more likely to suffer from noise than are optical media such as fiber-optic cable. In this case, noise is undesirable electrical signals (for fiber-optic cable, it is undesirable light). Noise is introduced by equipment or natural disturbances, and it degrades the performance of a communication circuit. Noise manifests itself as extra bits, missing bits, or bits that have been "flipped" (i.e., changed from 1 to 0 or vice versa). Figure 4.2 summarizes the major sources of error and ways to prevent them. The first six sources listed there are the most important; the last three are more common in analog rather than digital circuits.

Line outages are a catastrophic cause of errors and incomplete transmission. Occasionally, a communication circuit fails for a brief period. This type of failure may be caused by faulty telephone end office equipment, storms, loss of the carrier signal, and any other failure that causes a short circuit. The most common cause of line outages are storms that cause damage to circuits or facilities.

White noise or Gaussian noise (the familiar background hiss or static on radios and telephones) is caused by the thermal agitation of electrons and therefore is inescapable.

Source of Error	What Causes It	How to Prevent It
Line outages	Storms, accidents
White noise	Movement of electrons	Increase signal strength
Impulse noise	Sudden increases in electricity (e.g., lightning)	Shield or move the wires
Cross-talk	Multiplexer guardbands too small or wires too close together	Increase the guardbands or move or shield the wires
Echo	Poor connections	Fix the connections or tune equipment
Attenuation	Gradual decrease in signal over distance	Use repeaters or amplifiers
Intermodulation noise	Signals from several circuits combine	Move or shield the wires
Jitter	Analog signals change phase	Tune equipment
Harmonic distortion	Amplifier changes phase	Tune equipment

Figure 4.2 Sources of errors and ways to minimize them

Even if the equipment were perfect and the wires were perfectly insulated from any and all external interference, there still would be some white noise. White noise usually is not a problem unless it becomes so strong that it obliterates the transmission. In this case, the strength of the electrical signal is increased so it overpowers the white noise; in technical terms, we increase the signal-to-noise ratio.

Impulse noise (sometimes called spikes) is the primary source of errors in data communications. Impulse noise is heard as a click or a crackling noise and can last as long as V100 of a second. Such a click does not really affect voice communications, but it can obliterate a group of data, causing a burst error. At 1.5 Mbps, 15,000 bits would be changed by a spike of V100 of a second. Some of the sources of impulse noise are voltage changes in adjacent lines, lightning flashes during thunderstorms, fluorescent lights, and poor connections in circuits.

Cross-talk occurs when one circuit picks up signals in another. You experience cross-talk during telephone calls when you hear other conversations in the background. It occurs between pairs of wires that are carrying separate signals, in multiplexed links carrying many discrete signals, or in microwave links in which one antenna picks up a minute reflection from another antenna. Cross-talk between lines increases with increased communication distance, increased proximity of the two wires, increased signal strength, and higher-frequency signals. Wet or damp weather can also increase cross-talk. Like white noise, cross-talk has such a low signal strength that it normally is not bothersome.

Echoes can cause errors. Echoes are caused by poor connections that cause the signal to reflect back to the transmitting equipment. If the strength of the echo is strong enough to be detected, it causes errors. Echoes, like cross-talk and white noise, have such a low signal strength that they normally are not bothersome. Echoes can also occur in fiber-optic cables when connections between cables are not properly aligned.

Attenuation is the loss of power a signal suffers as it travels from the transmitting computer to the receiving computer. Some power is absorbed by the medium or is lost before it reaches the receiver. As the medium absorbs power, the signal becomes weaker, and the receiving equipment has less and less chance of correctly interpreting the data.

This power loss is a function of the transmission method and circuit medium. High frequencies lose power more rapidly than do low frequencies during transmission, so the received signal can thus be distorted by unequal loss of its component frequencies. Attenuation increases as frequency increases or as the diameter of the wire decreases.

Intermodulation noise is a special type of cross-talk. The signals from two circuits combine to form a new signal that falls into a frequency band reserved for another signal. This type of noise is similar to harmonics in music. On a multiplexed line, many different signals are amplified together, and slight variations in the adjustment of the equipment can cause intermodulation noise. A maladjusted modem may transmit a strong frequency tone when not transmitting data, thus producing this type of noise.

Jitter may affect the accuracy of the data being transmitted because minute variations in amplitude, phase, and frequency always occur. The generation of a pure carrier signal in an analog circuit is impossible. The signal may be impaired by continuous and rapid gain and/or phase changes. This jitter may be random or periodic. Phase jitter during a telephone call causes the voice to fluctuate in volume.

Harmonic distortion usually is caused by an amplifier on a circuit that does not correctly represent its output with what was delivered to it on the input side. Phase hits are short-term shifts "out of phase," with the possibility of a shift back into phase.

In general, errors are more likely to occur in wireless, microwove, or satellite transmission than transmission through cables. Therefore, error detection is more important when using radiated media than guided media.

Error Prevention

There are many techniques to prevent errors (or at least reduce them), depending on the situation. Shielding (protecting wires by covering them with an insulating coating) is one of the best ways to prevent impulse noise, cross-talk, and intermodulation noise. Many different types of wires and cables are available with different amounts of shielding. In general, the greater the shielding, the more expensive the cable and the more difficult it is to install.

Moving cables away from sources of noise (especially power sources) can also reduce impulse noise, cross-talk, and intermodulation noise. For impulse noise, this means avoiding lights and heavy machinery. Locating communication cables away from power cables is always a good idea. For cross-talk, this means physically separating the cables from other communication cables.

Cross-talk and intermodulation noise is often caused by improper multiplexing. Changing multiplexing techniques (e.g., from FDM to TDM) or changing the frequencies or size of the guardbands in FDM can help.

Many types of noise (e.g., echoes, white noise, jitter, harmonic distortion) can be caused by poorly maintained equipment or poor connections and splices among cables. This is particularly true for echo in fiber-optic cables, which is almost always caused by poor connections. The solution here is obvious: Tune the transmission equipment and redo the connections.

To avoid attenuation, telephone circuits have repeaters or amplifiers spaced throughout their length. The distance between them depends on the amount of power lost per unit length of the transmission line. An amplifier takes the incoming signal, increases its strength, and retransmits it on the next section of the circuit. They are typically used on analog circuits such as the telephone company’s voice circuits. The distance between the amplifiers depends on the amount of attenuation, although 1- to 10-mile intervals are common. On analog circuits, it is important to recognize that the noise and distortion are also amplified, along with the signal. This means some noise from a previous circuit is regenerated and amplified each time the signal is amplified.

Finding the Source of Impulse Noise

MANAGEMENT FOCUS

Several years ago, the University of Georgia radio station received FCC (Federal Communications Commission) approval to broadcast using a stronger signal. Immediately after the station started broadcasting with the new signal, the campus backbone network (BN) became unusable because of impulse noise. It took two days to link the impulse noise to the radio station, and when the radio station returned to its usual broadcast signal, the problem disappeared.

However, this was only the first step in the problem. The radio station wanted to broadcast at full strength, and there was no good reason for why the stronger broadcast should affect the BN in this way. After two weeks of effort, the problem was discovered. A short section of the BN ran above ground between two buildings. It turned out that the specific brand of outdoor cable we used was particularly tasty to squirrels. They had eaten the outer insulating coating off of the cable, making it act like an antennae to receive the radio signals. The cable was replaced with a steel-coated armored cable so the squirrels could not eat the insulation. Things worked fine when the radio station returned to its stronger signal.

Repeaters are commonly used on digital circuits. A repeater receives the incoming signal, translates it into a digital message, and retransmits the message. Because the message is recreated at each repeater, noise and distortion from the previous circuit are not amplified. This provides a much cleaner signal and results in a lower error rate for digital circuits.

Error Detection

It is possible to develop data transmission methodologies that give very high error detection performance. The only way to do error detection is to send extra data with each message. These error detection data are added to each message by the data link layer of the sender on the basis of some mathematical calculations performed on the message (in some cases, error-detection methods are built into the hardware itself). The receiver performs the same mathematical calculations on the message it receives and matches its results against the error-detection data that were transmitted with the message. If the two match, the message is assumed to be correct. If they don’t match, an error has occurred.

In general, the larger the amount of error-detection data sent, the greater the ability to detect an error. However, as the amount of error-detection data is increased, the throughput of useful data is reduced, because more of the available capacity is used to transmit these error-detection data and less is used to transmit the actual message itself.

Figure 4.3 Using parity for error detection

Therefore, the efficiency of data throughput varies inversely as the desired amount of error detection is increased.

Three well-known error-detection methods are parity checking, checksum, and cyclic redundancy checking.

Parity Checking One of the oldest and simplest error-detection methods is parity. With this technique, one additional bit is added to each byte in the message. The value of this additional parity bit is based on the number of 1′s in each byte transmitted. This parity bit is set to make the total number of 1′s in the byte (including the parity bit) either an even number or an odd number. Figure 4.3 gives an example.

A little thought will convince you that any single error (a switch of a 1 to a 0 or vice versa) will be detected by parity, but it cannot determine which bit was in error. You will know an error occurred, but not what the error was. But if two bits are switched, the parity check will not detect any error. It is easy to see that parity can detect errors only when an odd number of bits have been switched; any even number of errors cancel one another out. Therefore, the probability of detecting an error, given that one has occurred, is only about 50 percent. Many networks today do not use parity because of its low error-detection rate. When parity is used, protocols are described as having odd parity or even parity.

Checksum With the checksum technique, a checksum (typically one byte) is added to the end of the message. The checksum is calculated by adding the decimal value of each character in the message, dividing the sum by 255, and using the remainder as the checksum. The receiver calculates its own checksum in the same way and compares it with the transmitted checksum. If the two values are equal, the message is presumed to contain no errors. Use of checksum detects close to 95 percent of the errors for multiple-bit burst errors.

Cyclical Redundancy Check One of the most popular error-checking schemes is cyclical redundancy check (CRC). It adds 8, 16, 24, or 32 bits to the message. With CRC, a message is treated as one long binary number, P. Before transmission, the data link layer (or hardware device) divides P by a fixed binary number, G, resulting in a whole number, Q, and a remainder, R/G. So, P/G = Q + R/G. For example, if P = 58 and G = 8, then Q = 7 and R = 2. G is chosen so that the remainder, R, will be either 8 bits, 16 bits, 24 bits, or 32 bits.1

The remainder, R, is appended to the message as the error-checking characters before transmission. The receiving hardware divides the received message by the same G, which generates an R. The receiving hardware checks to ascertain whether the received R agrees with the locally generated R. If it does not, the message is assumed to be in error.

CRC performs quite well. The most commonly used CRC codes are CRC-16 (a 16-bit version), CRC-CCITT (another 16-bit version), and CRC-32 (a 32-bit version). The probability of detecting an error is 100 percent for all errors of the same length as the CRC or less. For example, CRC-16 is guaranteed to detect errors if 16 or fewer bits are affected. If the burst error is longer than the CRC, then CRC is not perfect but is close to it. CRC-16 will detect about 99.998 percent of all burst errors longer than 16 bits, whereas CRC-32 will detect about 99.99999998 percent of all burst errors longer than 32 bits.

Error Correction via Retransmission

Once error has been detected, it must be corrected. The simplest, most effective, least expensive, and most commonly used method for error correction is retransmission. With retransmission, a receiver that detects an error simply asks the sender to retransmit the message until it is received without error. This is often called Automatic Repeat reQuest (ARQ). There are two types of ARQ: stop-and-wait and continuous.

Stop-and-Wait ARQ With stop-and-wait ARQ, the sender stops and waits for a response from the receiver after each data packet. After receiving a packet, the receiver sends either an acknowledgment (ACK), if the packet was received without error, or a negative acknowledgment (NAK), if the message contained an error. If it is an NAK, the sender resends the previous message. If it is an ACK, the sender continues with the next message. Stop-and-wait ARQ is by definition a half-duplex transmission technique (Figure 4.4).

Continuous ARQ With continuous ARQ, the sender does not wait for an acknowledgment after sending a message; it immediately sends the next one. Although the messages are being transmitted, the sender examines the stream of returning acknowledgments. If it receives an NAK, the sender retransmits the needed messages. The packets that are retransmitted may be only those containing an error (called Link Access Protocol for Modems [LAP-M]) or may be the first packet with an error and all those that followed it (called Go-Back-N ARQ). LAP-M is better because it is more efficient.

Continuous ARQ is by definition a full-duplex transmission technique, because both the sender and the receiver are transmitting simultaneously. (The sender is sending messages, and the receiver is sending ACKs and NAKs.) Figure 4.5 illustrates the flow of messages on a communication circuit using continuous ARQ. Continuous ARQ is sometimes called sliding window because of the visual imagery the early network designers used to think about continuous ARQ.

Figure 4.4 Stop-and-wait ARQ (Automatic Repeat request). ACK = acknowledgment; NAK = negative acknowledgment

Figure 4.5 Continuous ARQ (Automatic Repeat request). ACK = acknowledgment; NAK = negative acknowledgment

Visualize the sender having a set of messages to send in memory stacked in order from first to last. Now imagine a window that moves through the stack from first to last. As a message is sent, the window expands to cover it, meaning that the sender is waiting for an ACK for the message. As an ACK is received for a message, the window moves forward, dropping the message out of the bottom of the window, indicating that it has been sent and received successfully.

Continuous ARQ is also important in providing flow control, which means ensuring that the computer sending the message is not transmitting too quickly for the receiver. For example, if a client computer was sending information too quickly for a server computer to store a file being uploaded, the server might run out of memory to store the file. By using ACKs and NAKs, the receiver can control the rate at which it receives information. With stop-and-wait ARQ, the receiver does not send an ACK until it is ready to receive more packets. In continuous ARQ, the sender and receiver usually agree on the size of the sliding window. Once the sender has transmitted the maximum number of packets permitted in the sliding window, it cannot send any more packets until the receiver sends an ACK.

Forward Error Correction

Forward error correction uses codes containing sufficient redundancy to prevent errors by detecting and correcting them at the receiving end without retransmission of the original message. The redundancy, or extra bits required, varies with different schemes. It ranges from a small percentage of extra bits to 100 percent redundancy, with the number of error-detecting bits roughly equaling the number of data bits. One of the characteristics of many error-correcting codes is that there must be a minimum number of error-free bits between bursts of errors.

Forward error correction is commonly used in satellite transmission. A round trip from the earth station to the satellite and back includes a significant delay. Error rates can fluctuate depending on the condition of equipment, sunspots, or the weather. Indeed, some weather conditions make it impossible to transmit without some errors, making forward error correction essential. Compared with satellite equipment costs, the additional cost of forward error correction is insignificant.

Error Control in Practice

In the OSI model, error control is defined to be a layer-2 function—it is the responsibility of the data link layer. However, in practice, we have moved away from this. Most network hardware and software available today provide an error control function at the data link layer, but it is turned off. Most network cables are very reliable and errors are far less common than they were in the 1980s.

Therefore, most data link layer software today is configured to detect errors, but not correct them. Any time a packet with an error is discovered, it is simply discarded. The exceptions to this tend to be wireless technologies and a few WAN technologies where errors are more common.

The implication from this is that error correction must be performed by software at higher layers. This software must be able to detect lost packets (i.e., those that have been discarded) and request the sender to retransmit them. This is commonly done by the transport layer using continuous ARQ as we shall see in the next topic.

How Forward Error Correction Works

TECHNICAL FOCUS

To see how error-correcting codes work, consider the example of a forward error checking code in Figure 4.6, called a Hamming code, after its inventor, R. W. Hamming. This code is a very simple approach, capable of correcting 1-bit errors. More sophisticated techniques (e.g., Reed-Solomon) are commonly used today, but this will give you a sense of how they work.

The Hamming code associates even parity bits with unique combinations of data bits. With a 4-data-bit code as an example, a character might be represented by the data-bit configuration 1010. Three parity bits, P1, P2, and P4, are added, resulting in a 7-bit code, shown in the upper half of Figure 4.6. Notice that the data bits (D3, D5, D6, D7) are 1010 and the parity bits (P1, P2, P4) are 101.

As depicted in the upper half of Figure 4.6, parity bit P1 applies to data bits D3, D5, and D7. Parity bit P2 applies to data bits D3, D6, and D7. Parity bit P4 applies to data bits D5, D6, and D7. For the example, in which D3, D5, D6, D7 = 1010, P1 must equal 1 because there is only a single 1 among D3, D5 and D7 and parity must be even. Similarly, P2 must be 0 because D3 and D6 are 1′s. P4 is 1 because D6 is the only 1 among D5, D6, and D7.

Now, assume that during the transmission, data bit D7 is changed from a 0 to a 1 by line noise. Because this data bit is being checked by P1, P2, and P4, all 3 parity bits now show odd parity instead of the correct even parity. D7 is the only data bit that is monitored by all 3 parity bits; therefore, when D7 is in error, all 3 parity bits show an incorrect parity. In this way, the receiving equipment can determine which bit was in error and reverse its state, thus correcting the error without retransmission.

The lower half of the figure is a table that determines the location of the bit in error. A 1 in the table means that the corresponding parity bit indicates a parity error. Conversely, a 0 means the parity check is correct. These 0′s and 1′s form a binary number that indicates the numeric location of the erroneous bit. In the previous example, P1, P2, and P4 checks all failed, yielding 111, or a decimal 7, the subscript of the erroneous bit.