Media Segmentation (Media Processing) (Video Search Engines)

Segmentation is important for a number of reasons. The concept of segmentation is to divide a stream of media into semantically consistent units and one benefit is increased efficiency in representation or compression. For example, in video compression shot boundary detection can be used so that difference frames are calculated within a shot instead of across shot boundaries, resulting in much smaller deltas. For representing content to users for rapid browsing, it is often desirable to remove the temporal element, i.e. represent long, static segments with a single icon, but also include shorter segments with similar icons, such as in a light-table view. While the temporal aspect is not preserved, the viewer can immediately conceive of the basic semantic content of the video without having to parse redundant information. Segmentation also benefits information retrieval since metrics such as TF/IDF are more accurate after text has been segmented by topic. Consider a news program with five stories, one of which mentions NASA and the space shuttle many times, while the other stories are unrelated. The calculated relevance rank of the program will be reduced if the frequencies of occurrence are averaged over the entire program.

One of the challenges for practitioners of media processing in the context of segmentation is to determine the appropriate level of granularity, or if multiple levels are to be maintained in the system, what is the appropriate number of tiers in the hierarchy. At the base of this pyramid, we have the media samples themselves, and compression algorithms attempt to remove redundancy at this level. Moving higher we can extract low-level features based on small windows of time or space, typically on the order of 10 milliseconds for audio, or corresponding to small homogenous regions of an image. Of course the definition of homogenous is a bit problematic: do we mean the same amplitude, same gradient, or for textures, the same periodic pattern? Moving a bit higher, we enter a realm where the extracted symbols may convey meaning, e.g. phonemes or words from speech recognition systems. Continuing in the speech domain for a moment, we encounter phrase or sentence segmentation tasks, and later topic or story segmentation. For image and video processing, we have object or foreground / background segmentation within a frame, camera operation detection, shot segmentation followed by scene segmentation – where multiple shots may take place in a single physical location. Farther up, for produced video programs, we have program segment (or commercial) detection and again story or topic segmentation, perhaps using cues from multiple media streams. Now, it is implicit that we can stop segmentation when we have reached the top: a single media asset or file. But what about episodic content? Does not each asset instance represent a segment of a longer narrative story where familiar characters reappear and evolve? And are not the productions of similar genre, or from the same source, somehow related? This latter level of segmentation moves us out of the signal processing and statistical classification domain into database organization – typically we have labeled data in the form of an EPG to guide us here.

Next post:

Previous post: