Clustering, Structure Generation (Media Processing) (Video Search Engines)

Assuming we have good segmentation, the notion of clustering or forming relations between segments using distance metrics arises. Consider a video program where two participants are discussing a series of issues, and we have three cameras in the studio, one for close-ups of each speaker and one for a wide shot. We now successfully detect the cuts between the shots to segment the media into logically consistent chunks at a temporal level on the order of tens of seconds. However it is clear that another level of structure can be derived by analyzing the segmented media in order to associate related segments. In this case, we may discover a pattern that the producer has used to move between cameras: A,B,C,A,B,A,C, etc. where A and B represent the close-up views of each speaker and C is the wide shot. As a second example, consider the case where an editor is trying to produce a rough cut from rushes or repeated ‘takes’ of a particular scene. We may detect the start and stop of the camera, but we can also discover that there were five attempts to capture the first scene, and then eight attempts of a subsequent scene, etc. By analyzing the relative “distance” (or similarity) between subsequent shots we can derive this structure. In fact commercially available editing systems can perform this function using audio cross-correlation assuming there is repeated dialog in each shot. This can be of great value for navigating and organizing the mass of raw footage during the editing process. This level of organization (aggregating repeated takes of a particular scene with or without dialog) has been the subject of a research evaluation undertaken by the National Institute of Standards and Technologies in a rushes summarization task [0ver07]. Summarization via automated content analysis allows users to more easily browse long-form content by removing redundancy within an asset, and clustering across search results sets facilitates browsing of large media archives.


Other important considerations should be borne in mind when evaluating media processing algorithms. Is the method rule-based, data-driven, or some combination thereof? Does the method involve the use of tunable parameters? How generalizable is the method? What are the storage vs. computational performance tradeoffs? For natural language applications, rule-based systems are generally quite useful when training data is not available but may become unmanageable as the complexity increases. Data-driven methods may offer the promise of managing this complexity in a scalable manner, but inevitably suffer from the problems that arise from the mismatch of training data vs. the data encountered in the field. We can expect performance to degrade over time as this gap between training and testing datasets widens. Steps must be taken to adapt the existing models over time or to the new domains. Active learning may be effective to minimize the labeling effort while maximizing performance improvement. For data driven methods the definition of the labels (typically as defined in an annotation guide) and their successful application by the labelers becomes an important factor in system performance. It is generally observed that more labeled data is better, but high quality labeling of large datasets is costly. Again the more-is-better camp will argue that we can ask many labelers to label the same data and use techniques to derive a consensus labeling. Recently it has been observed that the human power of the Web can be exploited, perhaps via game play scenarios, to build up large labeled data collections [Ahn06]. For these un-trained labeler situations, special effort is given to avoid tag synonyms, redundant or inconsistent labels, etc. Also, as we start to tap into the social capabilities available via the Web, we must also consider the practical limits to automated content processing. At some point, if our goals of media understanding are impractical, and if the value of the content is high enough, we run up against the alternative which is manual content description. For example, if a speech recognition system only performs acceptably for broadcast news content, then this system is of little value since most of this material is closed captioned or transcribed already. Many DVD subtitles are translated manually (and voluntarily) and posted up to Websites, rendering the use of machine translation systems in this domain moot.

Next post:

Previous post: