Digital Video (Internet Video) (Video Search Engines)

Introduction

Today’s digital video systems can produce excellent quality visual and auditory experiences at relatively low cost. However, Internet users still encounter many problems that result in an unsatisfactory experience. Although the situation has been steadily improving, buffering delays, incompatible formats, blocky, blurry images, jerky motion, poor synchronization between audio and video are not uncommon and lead to frustration to the point that the user experience of video services involving search is greatly impacted. User’s expectations are raised by their familiarity with broadcast television systems, where well defended standards, mature technologies, and abundant bandwidth prevail. In this topic, we provide background information to shed light on the complexities involved in delivering IP video. We address the practical issues that video search engine systems must resolve in order to deliver their “product” – relevant video information – to users.

Aspect Ratio

When designing user interfaces for visualizing video search results, the frame aspect ratio (FAR) of the source video and resulting thumbnails must be taken into account. For many years the ratio of width to height for the bulk of video on the Web was 4:3, but with HD cameras dropping in price, more and more 16:9 format video is appearing. Content sourced from motion picture film may have one of several aspect ratios, but has always had a wider aspect ratio than standard definition television. It is also common to find wide aspect ratio source material digitized within a 4:3 frame in letterbox format with black bars at the top and bottom. When presenting grids of thumbnails for visual browsing, these circumstances present basic layout issues, and make the thumbnails for some content appear smaller than for others, impeding browsing.

Metadata extraction systems must accommodate video with disparate spatial resolutions. For example, a system may detect faces and represent the bounding box results in XML format for content that is 640 x 480 or 320 x 240 but render a user interface with 160 x 120 thumbnails. We can scale the thumbnails or rely on the browser to do so, but we must also scale the bounding box coordinates if we are to plot the detection results overlaid on the thumbnails using Scaleable Vector Graphics (SVG) or Vector Markup Language (VML). So any image region-based metadata must be effectively normalized for query and display to handle source images of various scales and must support different vertical and horizontal scale factors to normalize different frame aspect ratios.

Pixel aspect ratio (PAR) further complicates the matter. Early analog cameras and analog TV systems did indeed have continuous signals along the scan lines that varied in relation to the illumination – similar to the situation with audio microphones. However, in the vertical direction, the picture was sampled as is done in digital systems. There is a discrete fixed number of “lines” per frame – for NTSC we can count on 480 valid lines of picture information. Of course for digital television, we must sample in the other dimension as well, and then quantize the samples. Since the FAR for NTSC is 4:3, we should divide each line into 640 pixels so that each sample covers the same small extent of the picture in the vertical and horizontal directions – a square pixel. So why should we introduce a “rectangular pixel?” It turns out that the channel bandwidth of NTSC specification justifies sampling the signal at a higher rate to preserve image detail. 720 is commonly used and ATSC DTV also specifies a sampling resolution for standard definition video of 704 x 480. So some content may be sampled with square pixels while other content may have pixels that look like shoe boxes standing on end. A feature detector based on spatial relations (e.g. Viola / Jones) trained on square pixel data will perform poorly on rectangular pixel data, so a preprocessing image conversion step is required. Of course it is possible to scale the detector or make it invariant to scale, but this is more complex. Failure to manage the complexity of FAR and PAR correctly not only degrades metadata extraction algorithm performance, it results in objectionable geometric distortion: circles looking like ovals, and actors looking like they have put on weight.

A similar issue can arise in the temporal dimension. We may encounter video with a wide range of frame rates. Rates of 30, 29.97, 25 and 24 frames per second are common and lower bit-rate applications may use 15 f/s. Security or Webcam video may forsake smooth motion altogether and use 1 f/s to save storage. Media players can render the video at the proper rate, but motion analysis algorithms that assume a given frame rate may not perform well for all content. This effect is not usually much of a problem since the design of these algorithms intrinsically accommodates a wide range of object velocities. Think here of gait detection or vehicle counters – the absolute estimate of object velocity may be affected but the detection rate may not be.

Interlacing is another source of problems for video systems. Interlacing was introduced years ago with the first television broadcast standards to effectively double the spatial resolution given a limited bandwidth channel. The cost, however, is lower temporal resolution (and increased complexity for video processing engineers.) The frame is divided into two fields, one with the odd numbered lines and one with the even. The fields are sent sequentially transmitted. The result is fine for static pictures, but any objects that are in motion result in saw-tooth edges if the video is paused or sampled at the frame resolution. If we are subsampling to create thumbnails, this may not be a problem. The new HDTV standards perpetuate interlacing (1080i vs. 720p). The term “progressive” is used to refer to noninterlaced video, but amusingly the term “progressive JPEG” refers to something similar to interlacing. Video processing algorithms must handle interlaced sources gracefully, by de-interlacing, dropping fields, or by taking into account the slight vertical sampling offset between consecutive fields.

The relation of illumination or intensity to signal amplitude mentioned above is nonlinear and is represented as an exponential referred to as ‘gamma’. Analog television systems were designed for CRTs with a nonlinear response and so precompensated the signal. Computer graphics applications and many image processing algorithms assume a linear relation.

Luminance and Chrominance Resolution

The human visual system cannot resolve image features that have differing hue but similar brightness as well as it can resolve features that vary in luminance. Therefore, compression and transmission systems encode chrominance information at lower spatial resolution than luminance with little apparent loss of image quality. The terms 4:2:2, 4:2:0, 4:1:1, etc. refer to the amount of subsampling of the chrominance relative to the luminance for different applications. When the image is rendered for display, it is converted from a luminance-chrominance color space such as Yuv or Y, Cr, Cb to R,G,B using a linear transform. Nonlinear transformations to spaces such as H,S,V yield a better match to the perceived visual qualities of color, but the simpler linear transformation is sufficient for coding gain. Single chip CCD or CMOS sensors designed for low cost consumer applications such as mobile phones or cameras also take these effects into account. Rather than having an equal number of R,G,B sub-pixels, a color filter array such as the Bayer checkerboard [Bayer76] is used to produce an image with relatively higher luminance resolution. This scheme has twice as many green pixels as red or blue. Another point to consider is that the spectral sensitivity of the human eye peaks in the green region of the spectrum, while silicon’s sensitivity is highest in the infrared (IR). IR blocking filters are used to select the visible portion, but the sensitivity of the blue is much lower than the red. The resulting signal to noise ratio for the blue component is always lower than the green or red. Color correction processing as well as gamma correction tends to emphasize this noise. Also, color correction parameters are determined for given illumination conditions and, particularly in consumer applications, poor end-to-end color reproduction is common. Noise in the blue component, subsampled chrominance, and poor color reproduction not only degrade image quality, but also degrade performance of video processing algorithms that attempt to take advantage of color information.

Video Compression

Web media is compressed; users almost never encounter original, uncompressed video or audio – the sheer scale of storage and bandwidth required makes this impractical. Even QVGA resolution requires over 55 megabits per second to render in 24 bit RGB at 30 frames per second, while higher resolutions require even more bandwidth. The requirement that video be compressed has several implications for video search engine systems as we shall see.

Lossless video compression is rarely used since the bitrate reduction attainable is quite limited. Lossy compression offers impressive performance, but comes at the price of information loss – the original image or video sequence cannot be fully recovered from the compressed version. The distortion between the original the reconstructed image is often measured using the peak signal to noise ratio PSNR although this is well known to be a poor match to perceived image quality. It is extremely difficult to quantify image quality; it is highly subjective and content dependent. PSNR is an example of a “full reference” quality metric as defined by ITU-T Recommendation J. 144 – “partial reference” and “no reference” techniques are used for applications where full reference data is not available, for example measuring quality at the set-top box at the end of a video delivery service [J.144]. Compression algorithms are evaluated using ratedistortion plots which reflect attempts to approach the information theoretic limits outlined in Shannon’s rate distortion theory. Algorithmic improvements have made great strides in pushing the theoretic limits, while Moore’s law has allowed for increasingly complex implementations to be standardized and used in practical systems.

Since video is a series of still frames, one would expect that video compression is related to the JPEG image compression used in digital cameras, and, in fact, this is indeed the case. Many consumer cameras capture video as a sequence of JPEG frames to create “Motion JPEG” (M-JPEG) format since the computational complexity of this approach is minimal. At the high end, professional editing systems use M-JPEG or “MPEG-2 I frame-only” as well. Here the systems are designed for high-quality and ease of cutting and splicing sequences together, rather than on high compression ratios.

JPEG works by dividing an image into small blocks and transforming (using the Discrete Cosine Transform) from the pixel domain to the spatial frequency domain. In this domain, pixels whose intensity values are similar to their neighbors can be efficiently represented – in smooth areas of an image, an entire block can be approximated by just its average (or DC) value or just a few DCT coefficients. To get an intuition for the concept of spatial frequency, take a look at a folder of digital photo files and sort them by the file size. The larger files will have a large proportion of the image in sharp focus with a lot of edge information, say from a brick wall or a tree with leaves. The smaller, more compressed, files will be the out of focus shots or contain a small object on a large homogenous background. Now suppose that we point a camera at a brick building and capture a video sequence in vivid detail. The frames are nearly identical – they have a high degree of temporal redundancy. By subtracting the second frame from the first, we end up with a frame that is mostly uniform, perhaps with a small region where someone sitting by a window in the building moved slightly. As we have found, this is the type of image that compresses well, so that our entire sequence can be efficiently represented by encoding the first frame (intra-frame coding) followed by encoding the difference between this frame and subsequent frames (inter-frame coding). Now of course there are some complications that arise due to temporal noise in the signal, and illumination changes due to passing clouds, etc. But the main problem in this scenario is that slight camera motion will result in a large difference image in any region where the image is not uniform (e.g. the sky will not cause much of a problem.) Video coders compensate for this using block-matching where a block of one frame is compared to several neighboring blocks in another subsequent frame to find a good match. In the case of a shift in the camera, most blocks will have the same shift (or motion vector). So, video compression from MPEG-1 up through MPEG-4 is based on DCT of motion compensated frame difference images.

Video compression standards are designed and optimized for particular applications; there is no one-size-fits-all codec. The ITU developed the H.261 and H.263 for low bitrate, low latency teleconferencing applications. For these applications, the facts that the camera is usually stationary (perhaps mounted on pan-tilt stage next to a monitor) and that conferencing applications typically involve static backgrounds with little motion greatly help improve the quality at low bitrates. It is reasonable here for coders to transmit intra-coded blocks rather than entire frames. MPEG-1 was developed for CD-ROM applications with bitrates in the 1 Mb/s range. MPEG-2 is used in broadcast distribution and in DVDs where higher quality and interlaced video support are requirements. MPEG-4 brings increased flexibility and efficiency, of course with increased complexity, and finally the ITU and MPEG bodies have achieved interoperability with MPEG-4 part 10, ITU H.264/AVC. For contribution feeds or editing applications M-JPEG or similar intra-coded video at very high bitrates is appropriate to ensure quality downstream.

MPEG-2 Systems [Info00] added a wide range of capabilities that were not available with MPEG-1. While “program streams” are used for file based applications (MPEG uses the term DSM – Digital Storage Media) which have negligible error, the notion of a transport stream was introduced to allow for efficient delivery over noisy channels such as may be found in typical broadcast systems such as cable or today’s IPTV over DSL. The transport stream specification also supports multiplexing several (even independent) media streams which enables secondary audio programming or alternative representations of the video at different resolutions and bitrates [Haskell97]. Table 3.1 lists a few common video compression standards and bitrates typically encountered. For actual maximum and minimum bit rates supported, readers should consult the standard documents.

Table 3.1. Applications of video compression systems (bit rates are approximate, and assume standard definition).

Standard	Typical bitrates	Common applications
M-JPEG, JPEG2000	Wide range, up to 60M	Low cost consumer electronics, High end video editing systems
DVCAM	25M	Consumer, semi-pro, news gathering
MPEG-1	1 .5M	CD-ROM multimedia
MPEG-2	4-20M	Broadcast TV, DVD
MPEG-4 / H.264	300K-12M	Mobile video, Podcasts, IPTV
H.261, H.263	64K-1M	Video Teleconferencing, Telephony

Within all of these standards, there are “profiles” which are particular parameter settings for various applications. The latter standards have a wide range of flexibility here which allows them to span a wide range of applications while the earlier standards are more constrained. So it is possible for an MPEG-4 decoder not to be able to decode an MPEG-4 bit stream (e.g. if the decoder only supports a baseline profile). Profiles are intended for varying degrees of complexity (i.e. required computational power of encoders / decoders) as well as latency or error resilience. For example, for DVD applications, variable bit rate (VBR) encoding allows bits required to represent high action scenes to be effectively borrowed from more sedate shots. Of course, the player has to read large chunks of data from the disk and store them in a local buffer in order to decode the video. On the other hand, for digital broadcast TV, rapid channel change is desirable so the buffering requirements are kept to a minimum. The quality difference between DTV and DVD leads many viewers to think that DVDs are HD while in fact only Blue Ray and HD-DVDs support higher resolution than standard definition. Some of this confusion arises because DVDs are often letterbox, but primarily it is due to the lack of obvious coding artifacts such as blocking or contouring. Higher bitrates play a role, but even at the same bitrate, real-time encoding for low latency applications results in lower quality. Additionally, the quality of the source is key – some digital television sources are of dubious quality, perhaps with multiple generations of encoding – as well as the fact that mastering DVDs is done offline, allowing for two-pass encoding. DVD mastering is really an art; a bit like making a fine wine as opposed to producing grape juice. So, encoding systems designers have a challenging job to balance latency, complexity, error resilience, and bandwidth to achieve the quality of experience that the viewer ultimately enjoys.

What implications do these video compression systems have for video search engines?

• Video content analysis / indexing algorithms must either support the formats natively, or transcode to a format that is supported. Since many algorithms operate in the pixel domain as opposed to the compressed domain, this “support“ may simply imply that the system can decode the video. However, the video quality does have an effect on indexing accuracy – noise or image coding artifacts such as blocks can be significant problems. Also, in some cases, periodic quality fluctuations due to poor bit allocation between intra- and inter-coded frames can produce more subtle artifacts.

• Of course from a systems perspective, high bitrate video may not be practical to archive on-line at scale. Further, each format must be supported by the client media player, and by media servers as well. This problem of incompatible media players and formats is driving a move to Flash formats, which at least offers a degree of independence from the client operating system.

• Finally, as we have seen, these codecs are highly optimized for particular applications, and this typically does not include streaming or fine grained random access.

MPEG frames are organized as “groups of pictures” or GoP which consists of an intra-coded frame (I frame) and several predicted frames (P and B frames). Applications such as media players can’t jump into a video stream in the middle of a GoP and start playing – they must refer back to the I frame. So in effect the GoP length determines the precision for media replay requests. For many applications the GoP length is less than a second (15 frames is common) so this has only minor effects on the user experience, but for high coding efficiency applications, “Long GoP” coding is used where there may be several seconds between I frames. H.264/AVC introduces many more complex options in this area such as multiple reference frames for different macroblocks which further exacerbate random access [Rich03].