The Impact of Digital Television and HDTV (Display Interfaces) Part 2

Video Compression Basics

In short, it was the availability of practical and relatively low-cost digital compression and decompression hardware that made digital HDTV possible. This same capability can also be exploited in another way, and has in commercial systems – by reducing the channel capacity required for a given transmission, digital compression also permits multiple standard-definition transmissions, along with additional data, to be broadcast in a single standard channel.

Consider one frame in one standard “HD” format - a 1920 x 1080, 2:1 interlaced transmission. If we assume an RGB representation at 8 bits per color, each frame contains almost 50 million bits of data; transmitting this at a 60 Hz frame rate would require almost a 375 Mbytes/s sustained data rate. Clearly, the interlaced scanning format will help here, reducing the rate by a factor of two. We can also change to a more efficient representation; for example, a YCrCb signal set, with 8 bits/sample of each, and then apply the 4:2:2 subsampling described above. This would reduce the required rate by another third, to approximately 124 Mbytes/s, or just under 1 Gbit/s.

But by Shannon’s theorem for the data capacity of a bandlimited, noisy channel, a 6 MHz TV channel is capable of carrying not more than about 20 Mbit/s. (This assumes a signal-to-noise ratio of 10 dB, not an unreasonable limit for television broadcast.) Transmitting the HDTV signal via this channel, as was the intention announced by the FCC in calling for an all-digital system, will require a further reduction or compression of the data by a factor of about 50:1! (Note that this can also be expressed as requiring that the transmitted data stream correspond to less than one bit per pixel of the original image.) Digital television, and especially digital HDTV, requires the use of sophisticated compression techniques.

There are many different techniques used to compress video transmissions.However, a review of the basics of compression, and some specific information regarding how digital video data is compressed in the current DTV and HDTV standards, is needed here.

Recall that compression techniques may be broadly divided into two categories: lossless and lossy, depending on whether or not the original data can be completely recovered from the compressed form of the information (assuming no losses due to noise in the transmission process itself). Lossless compression is possible only when redundancy exists in the original data. The redundant information may be removed without impacting the receiver’s ability to recover the original, although any such process generally increases the sensitivity of the transmission to noise and distortion. (This must be true, since the redundant information is what gives the opportunity for “error correction” of any type.) Lossy compression methods are those which actually remove information from the transmission, relative to the original data, in addition to removing redundancy. In most cases, lossy compression methods are acceptable only when the original transmission can be analyzed, and classes of information within it distinguished according to how important they are to the receiver (or to the end user). For example, many digital audio systems employ a compression method in which information representing sounds at too low a level, relative to other tones that would be expected to “mask” them, is removed from the data stream.

It should be clear at this point that the practice of limiting the bandwidth of the color-difference signals (and subsampling them in the digital case) is an example of such a system. Information relating to color is being lost in order to “fit the signal into the channel”. But this loss is accepted, as it is known that the eye places much more importance on the luminance information and will not miss the high-frequency portions of the color signals. Interlacing has also been cited as an example of a crude compression scheme in the analog domain, one which is lossless for static images but which gives up resolution along the vertical axis for moving objects (among other problems). Other techniques that may be seen as simple types of compression include:

• Full or partial suppression of sidebands (theoretically lossless; removes redundancy). This is used, as noted, in standard analog television broadcasting.

• Removal of portions of the transmission corresponding to “idle” periods of the original signal. For example, in a digital system, there is no need to transmit data corresponding to the blanking periods of the original signal, as the receiver can restore these as long as the individual lines/fields are still distinguishable.

• Run-length encoding of digital data. This is exactly what the name implies; in systems where significantly long runs of a steady value (either 1 or 0) may be expected to be produced, transmitting the length of these runs rather than the raw data will often save a significant amount of capacity. This is a lossless technique.

• Variable-length coding (VLC). Given that not all values are equally likely in a given system, a VLC scheme assigns short codes to the most likely of these values and longer codes to the less likely. An excellent example is “Morse” (actually “International”) radio code, in which the patterns of dots and dashes were assigned roughly in accordance with the frequency of the letters in English. Thus, the letter “E” is represented by a single dot (“.”), while a relatively uncommon letter such as “X” is represented by a longer pattern of dots and dashes (“-.-“)

• Quantization. While the act of quantization is not necessarily lossy (if the quantization error is significantly smaller than the noise in the original signal), quantization can also be used as a lossy compression scheme. In this case, the loss occurs in the deletion of information that would otherwise correspond to lower-order bits. (The loss, or error, so introduced may be reduced through dithering the values of the remaining bits over multiple samples.)

In the compression method most commonly used in present-day digital video practice (the “MPEG-2” system; the abbreviation stands for the Motion Picture Experts Group), a combination of several of these, along with a transform intended to reduce the impact of the compression losses, are employed. The remainder of this section examines the details of this system as it is commonly implemented in current DTV/HDTV standards.

The discrete cosine transform (DCT)

The basis for the compression method used in current digital television standards, although not technically a compression method itself, is the Discrete Cosine Transform, or DCT. This might best be seen as a two-dimensional, discrete-sample analog to the familiar Fourier transform, which permits the conversion of time-domain signals to the frequency domain, and vice versa. The two-dimensional DCT, as used here, transforms spatial sample information from a fixed-size two-dimensional array into an identically sized array of coefficients. These coefficients indicate the relative content of the original sample set in terms of discrete spatial frequencies in X and Y. Mathematically, the discrete cosine transform of any N x N two-dimensional set of values f(j,k) (where j and k may have values from 0 to N – 1) is given by

where C(x) is defined as 1/(V2) forx = 0, and 1 for x = 1,2,…N- 1.

As commonly used in digital television standards, the DCT process operates by dividing the original set of samples – the pixels of the image – into blocks of 8 pixels by 8 lines each. (The 8 x 8 block size is a compromise between the desire for significant compression, the requirement to be able to perform this process at video-suitable rates, and the requirement to maintain an acceptable level of delivered image quality. It will be apparent that these same methods could be applied to differently sized blocks) These are transformed via the above to 8×8 blocks of DCT coefficients. Note that this is a fully reversible operation, assuming that arbitrary precision can be obtained throughout, and so the DCT itself does not involve a loss of information. (In practical terms, when this is implemented digitally, the calculations must be carried out using a greater number of bits than were present in the original samples, to avoid loss due to truncation or rounding. In the case of 8-bit input values, the coefficients must be allowed at least 12 bits (signed) each for the process to be reversible. If fewer than this number of bits is provided, the DCT becomes irreversible without loss.) Note also that the DCT can result in negative values for the coefficients (due to the use of the cosine function).

As the DCT operating on 8 x 8 blocks of pixels produces coefficients relating to 8 discrete spatial frequencies each in X and Y, the coefficients in each cell of the resulting 8 x 8 array are best viewed as giving the relative weight of each of 64 basis functions, as shown in Figure 12-2. These images show the appearance of the combination of the separate X and Y waveforms corresponding to each point in the 8×8 array. The original image therefore may be recovered by summing these basis functions in accordance with this relative weighting.

Weighting, Quantization, Normalization, and Thresholding

As noted, the DCT itself is not a compression technique, and (as long as sufficient accuracy is maintained in the coefficients) neither results in a loss of data or a reduction in the data rate. However, it places the image information into a form that is now easier to compress without significant impact in the final image quality. The array of coefficients so generated represents the spatial frequency content of the original block of pixels, from the DC component as the top-leftmost coefficient, to the highest frequency in both X and Y at the bottom right. In terms of importance to the overall image, it can be assumed that the DC component is the most important value for this block, as the higher frequency coefficients represent finer detail. Thus, the first step following the transform in a DCT-based compression is to assign “weights” to the coefficients based on their relative importance. The weighting of each is generally specified in the particular compression standard in use. Following weighting of the coefficients, they will be quantized and truncated (converted to a specific bit length for each, which may not be constant across the full set), normalized per the requirements of the standard or system in use, and subjected to “thresholding.” The thresholding step sets low-value coefficients – those below a specified threshold, which again may vary across the full set – to zero. These steps are illustrated in Figure 12-3.

Figure 12-2 The 8 x 8 basis functions of the discrete cosine transform (DCT) compression method.

Encoding

At this point, some compression of the original data may have been achieved, as information corresponding to the higher spatial frequencies may have been eliminated or reduced in importance (via the quantization and thresholding processes). Further compression may now be achieved by noting that the resulting coefficients for these upper frequencies are those most likely to have a zero value. Thus, the information in this array of coefficients may be most efficiently transmitted through a “zig-zag” ordering of the data (per Figure 12-3c), and apply-ing a combination of run-length and variable-length coding techniques to the resulting series of values. The “zig-zag” ordering tends to maximize the length of zero runs, making for the most effective compression in this manner.

Figure 12-3 Outline of the complete DCT-based compression process. A block of pixels of a predefined size (in this case, 8 x 8) is transformed to an equivalent-sized array of DCT basis function coefficients. This step is lossless, assuming sufficient precision in the calculation of the coefficients. These are then weighted and quantized – a lossy step – and re-ordered in the “zig-zag” fashion shown. This maximizes the length of runs of zero values, as these are more likely in the higher-frequency coefficients. Finally, the sequence is run-length encoded for further compression.

By itself, a DCT-based compression method as described here can result in a compression ratio of up to about 20:1. (It is important to note that the parameters of DCT compression, such as the weighting table, quantization, etc., may be adjusted to trade off image quality for compression, and vice versa. This can even be done “on the fly” to meet varying requirements of the transmission channel, source material, and so forth.) But so far, we have looked only at compressing the data in a given single image – which might be a still picture, at this point. Further compression is required for HDTV applications, and is achieved by noting the redundancies that exist in motion video.

Compression of Motion Video

The DCT-based compression methods described up to this point can achieve a significant savings for single, isolated images (and in fact are used in such applications, as in the JPEG -Joint Photographic Experts Group – format). However, television is a medium for transmitting moving images, as a series of successive stills, and we must also ask if further compression can be achieved by taking advantage of this. In the most extreme case, which is the television transmission of a still picture, the answer is clearly yes – conventional TV broadcasting practice would be sending the same picture over and over again, an obvious case of extreme redundancy. It would be far better to simply send the picture once, and not send anything further until a change in the image had occurred.

But even “moving” pictures exhibit this same sort of redundancy to a large degree. In the vast majority of cases, there is still considerable similarity between successive frames of a film or a video transmission. An object may move, but the appearance of the object itself has changed little if any from one frame to the next. In the simplest example of this, consider “motion” which occurs by panning a camera across a scene comprising only still objects. Some new information enters the frame at one side, while “old” information leaves at the other, but the majority of the image could easily be described as “same as before, but shifted by this amount in that direction.” Sending such a description, plus the small amount of “new” data which has entered the image, is clearly more efficient that sending a complete new frame.

In practice, these concepts may be extended and generalized through the use of motion estimation/prediction techniques, applied to the same blocks of pixels as were used in the DCT transform above, and through the transmission of difference or error information to be used in correcting the resulting “predicted” frames. This technique is shown in the series of illustrations given in Figure 12-4.

Figure 12-4 Motion prediction. Rather than compressing and transmitting each frame individually, digital television systems achieve further efficiency gains through the use of motion prediction. Here, motion vectors are calculated for each block of the original (as used in the DCT compression process described earlier) by comparing adjacent frames. These give the average motion for the pixels in that block. These vectors may be used to produce a predicted next frame, which is then compared to the actual next frame. The errors found in this comparison are themselves compressed and transmitted to the receiver along with the motion vectors, and together permit the receiver to generate an fairly accurate version of the next frame.

The process begins by assigning a motion vector to each of the previously determined blocks of pixels within the image. This is done by comparing successive frames or fields of the series being transmitted, and attempting to determine the new (or previous) location of each block. That determination is made by minimizing the error resulting from treating a given frame as only the translation of blocks in the preceding or following frame. Note the it is assumed that all pixels in a given block experience the same displacement, and that the motion of each block is assumed to consist only of translation in X or Y – the block size and shape does not change, nor are possible rotational motions considered. With a motion vector assigned to each block, it now becomes possible to create a predicted frame based solely on this information and the blocks from the previous frame. (Note: the MPEG compression system actually involves some bidirectional motion prediction, in which frames are predicted not only from the previous frame, but also from the next. Throughout this discussion, one should keep in mind that the techniques discussed are also often applied in the “reverse” direction as well as the forward.)

However, it is clear that such a predicted frame would have significant errors if compared to the actual next frame in the original series; the pixels do not all move uniformly within a given block, and there will be areas of the “new” frame containing information not present in the original frame. But these errors may be compensated for to a large degree. Since it may safely be assumed that both the source encoder/compression hardware and the receiver can produce identical “predicted” frames through the above process, the encoder will also compare this frame to the actual next frame in the series. This may be done by simply subtracting one frame from the other – the resulting non-zero values represent errors between the two. Such an “error frame,” though, clearly requires less data transmitted than a complete “original” frame. Static portions of the image, or those moving areas where the translation of blocks does accurately describe the results, will show zero error – and so the error frame may most often be expected to be mostly zeroes. Thus, transmitting only the motion vectors and the error information should permit the receiver to generate a predicted frame that is a very good approximation to the “real” image which would otherwise have been transmitted at that time.

In the actual MPEG compression standards, three types of frames are recognized, as follows:

• The I-frame, or “intra” frame. This is a fully self-contained, compressed version of one frame in the original series. Simply put, the I-frame is the basis for the generation of all predicted frames by the receiver, until the receipt of the next I-frame. I-frames are not the result of any motion prediction; they are, instead, the “starting point” for the next series of such predictions.

• The P-frame, or “predicted” frame. P-frames are generated by the receiver, using the motion vector and error information supplied in the transmitted data, per the above description. In the transmission, multiple P-frames may be produced between I-frames, by sending additional motion vector and error information for each.

• The B-frame, or “bidirectionally predicted” frame, also known as the “between” frame. B-frames are produced by applying both forward motion prediction, based on the latest I-frame, and backward prediction, based on the P-frame generated from that I-frame and the additional motion and error information. B-frames may be viewed as interpolations between I- and P-frames, or between pairs of P-frames.

These are shown in Figure 12-5.

Figure 12-5 The stream of I, P, and B frames in a digital television transmission.

The number of P- and B-frames between I-frames is not fixed; these may be varied depending on the needs of the data channel, the source material, and the desired image quality. Clearly, a transmission of a static or nearly static image can be sent with relatively few I-frames without suffering a visible loss in image quality. Conversely, video containing a high degree of complicated motion may require fewer P- and B-frames between I-frames, and may have to demand higher compression (and the resulting loss in image quality) in the I-frame to compensate. Permitting the compression system to be adaptive in this manner allows it to continually adjust itself for the best overall image quality possible in a given situation.

This is not to say that the MPEG-2 compression method, as implemented in various HDTV, DBS, and other digital-television systems, always delivers a displayed image that is indistinguishable from the original. Trying to fit a very large amount of information into “too small a pipe” is never done without some impact on the quality of the transmission, and in practice compression artifacts and other errors can become quite visible. Momentary signal loss, or periods when the system cannot adjust its behavior rapidly enough, or those situations in which there simply is not sufficient channel capacity for the task at hand, will all result in visible errors. Due to the nature of this compression method, the most commonly seen errors include the visibility of the “block” structure, especially around areas of high detail and rapid motion, such as rapidly moving edges, or momentary corruption of the image (again in a visibly “blocky” manner) until enough new data has been received to “rebuild” it.