Introduction to Media Processing (Video Search Engines)

The only media descriptions available for the vast majority of media published on the Web today is global high level metadata. To differentiate themselves from systems that treat the described media payload as an opaque data file, individual systems must employ automated content processing. While specific processing methods have been optimized to handle particular media types, there are common principles that apply to some degree across all media types. The field of digital signal processing includes several areas of focus including speech, audio, image, and video processing. If we stretch the notion of signal processing from digitizing an analog waveform to include streams of symbols, we can consider text streams corresponding to the media dialog to be signals as well [Rab99]. Common media processing operations include noise reduction, re-sampling, compression, segmentation, feature extraction, modeling, statistical methods, summarization, and building compact representations for indexing and browsing.

In the previous topics we discussed the practical issues of compression systems in use today as well as container file formats for media streams. We discussed media related text streams and formats including closed caption, subtitles, transcripts, etc. Here we will present at an introductory level, the common elements for media processing as it relates to content-based video search engine systems. In later topics, we will explore in greater detail some of the most common methods applied to audio, video and text streams, and we will present multimodal processing where these media streams may be processed in a coordinated manner to achieve greater accuracy than is possible by processing the components individually.


As we look into each media type in more detail, we will focus on feature extraction, segmentation, and information extraction. For us, the desired goal of media processing is to take largely unknown content and extract some level of structure and possibly semantics about the content. In the case of data mining, we might hope to obtain actionable information based on this analysis. We will find that low-level feature extraction has been well studied and robust, efficient methods exist for operating on multiple media types, but moving to true semantics or meaning is successful only in restricted domains – where we have some domain-specific knowledge and perhaps have developed models based on similar labeled data. This difficulty with the current state of the art in moving from low-level features to useful understanding of the media content is often referred to as the semantic gap.

Conceptual view of media processing.

Fig. 5.1. Conceptual view of media processing.

Fig. 5.1 represents a hierarchal view of media processing for video retrieval applications and focuses on the case of shot boundary detection in particular. Each level is characterized by functional blocks with representative input and output data types shown. The scope of media processing is broad indeed, when one considers that similar “drilldown” views could be drawn for each of the other tasks such as speaker identification, text-based topic segmentation, etc. Representative features of color, shape descriptors, etc. are shown and MPEG-4 is shown as an illustrative media source format. The results may be represented in an XML format such as MPEG-7 as the figure suggests.

Next post:

Previous post: