Introduction to Video Search Engine Systems

All search engine systems share a common architecture at a high level, but vary widely depending on the application and design choices. In general, there are three main architectural components as we view the system from a content flow perspective: content acquisition, processing (indexing), and retrieval (see Fig. 4.1). In practice, these are typically decoupled independent processes in order to ease scaling. We will also consider the system from a user activity perspective in which we can consider behaviors and system states.

High level architecture of a typical video search engine.

Fig. 4.1. High level architecture of a typical video search engine.

Acquisition refers to bringing source video content into the system and positioning it for subsequent indexing. This may involve copying the bulk media to local storage as in a traditional text search engine, or other modalities such as user contribution or even capture of live feeds. Acquisition is constrained or configured; for example, a list of content sources or RSS feeds may be used. Content providers may use the Outline Processor Markup Language (OPML) to create lists and publish them to search engines. Even in the case of a general Web crawl, the prior state of the crawl is used to direct future content location attempts, so the process is not free of constraints. Efficient crawling is well studied [Cha03] and like other aspects of scalable search, is typically implemented in a distributed fashion.


Content or media processing is the next logical stage in the content flow and involves transcoding, metadata manipulation, extraction and augmentation through media analysis methods. The goal is to capture the media structure and metadata in data structures that enable rapid retrieval and content adaptation.

The third major functional block from a content flow perspective is retrieval where a query engine responds to user requests in a real-time interactive mode. The results are exposed through one or more user interfaces and multimedia summaries or contextual information may be generated to improve the user experience. In addition to real-tine query handling, modules for personalization or data mining can operate on the stored multimedia collection in an offline fashion to produce customized views or analytical results for users.

Next post:

Previous post: