Challenges of Video Search (Video Search Engines)

Searching requires browsing sets of candidate results. Video is a continuous (or linear) medium: if paused, only a single frame remains, audio is lost. Text is displayed in a more parallel fashion and can therefore be browsed easily. Video storage and transmission requirements are several orders of magnitude greater than those for text. Textual features (characters, words) are well defined, can be efficiently encoded, and are limited in number. Video features (edges, colors, motion) and acoustic features (pitch, energy) are less well-defined, computationally expensive to extract, and bulky to represent. In fact, there is little consensuses on which features are best for a given application. Furthermore, users can formulate textual queries easily using a keyboard so that, to a first approximation, the information retrieval problem reduces to a symbol look-up (i.e. find me the documents containing this word). For video databases, the query-response cycle is cross-modal (enter text, retrieve video). Query by image content involves building a query by specifying image or video attributes perhaps with a graphical tool which is beyond the patience limits of the typical user [Flick95]. Query by example or relevance feedback methods are easier to use but require some seed search to bootstrap the process.

Comparing some of the issues faced by video search engine systems to their analogs from the text domain sheds light on the nature and scope of the challenges encountered.


The term invisible Web or hidden Web refers to Web resources that are not easily indexed by Web search engines. Search engines use crawlers (also called spiders) to locate content for indexing by following links that they encounter in each page that they parse. However, instead of maintaining large collections of HTML files, many sites generate HTML pages dynamically from content stored in XML files or in relational databases. The content may be exposed only if users search using a Web form, an action which crawlers cannot easily mimic. Another problem for crawlers arises from sites that require user registration and authentication in order to access content. Estimating the size of the invisible Web is obviously difficult since by definition the content cannot be seen, but it may be orders of magnitude larger than the surface (visible) Web. There are also socioeconomic aspects to this issue since surface content is dominated by commercial enterprises and is funded largely by advertising, while hidden Web content is often premium, academic, etc. Some would go so far as to dismiss the invisible Web content entirely by saying that since users only use search engines to locate content then it does not matter if content exists out of the reach of their favorite search engine.

Although the scale is not easily quantifiable, as far as users’ expectations are concerned, the phenomenon of invisible Web is more severe for video than for text. There are cases where Web pages contain links directly to static video files, but this is the exception rather than the norm. Video content is typically accessed through a player with complex scripting used to specify the video asset. Due to the size of the media objects and complexities of maintaining news content, asset management or publishing tools are typically used which are linked to databases.Professionally produced video entails high production costs and sites recover the investment through advertising or subscriptions. Video advertising via forced playlists also foils search engine crawlers. Video protected by digital rights management (DRM) precludes content based analysis. Attempts by search engines to circumvent any of these revenue-persevering schemes will not be received favorably by the content owners. Consumer produced content posted on sharing sites, on the other hand, is often open to all viewers for free and sites may have mechanisms to generate permanent links to videos. Crawlers may encounter these links on other sites and the links point back to a full page rather than directly to the video file. Stream saver or downloader tools have been developed to work around these issues.

Stale links arise from content being moved or deleted after a crawler has indexed the content. While this is a problem in both the text and video domains, it may be more likely for video files because large file sizes or rights issues may lead sites to remove content after a certain period of time.

Media File Formats

Considering parsing, one can build a very useful text search engine by dealing with only a single content source file format: HTML. As an afterthought, one could add support for Adobe PDF, Microsoft Word and perhaps one or two more, but these formats represent such a tiny fraction of the total available Web documents that users may not even notice their omission. The HTML format is designed to be easily parsable and although authors may create mal-formed HTML, there are many available error-tolerant parsers to choose from. Video, on the other hand, comes in a wide variety of formats and it is not clear which format is the most popular at any given time. New formats emerge, rise in popularity, and then may be knocked from the top spot as still newer formats gain popularity. Keeping up with these developments is a challenge for video search engines. Video container file format parsers and decoders are complex and often brittle so that relatively minor deviations from the video encoding standard may cause parsing failure. Decoders may be able to deal with only a subset of the permissible video encoding parameter space or only handle certain “profiles” (e.g. MPEG-4 simple profile) and may not be able to deal with others. Solutions have been built to address these issues but these solutions are complex to configure, administer and can be costly to operate at scale.

Data Transport

The data transport protocols for media are more diverse than for Web text. Again, crawlers need only implement the HTTP in order to cover most of the Internet content, with FTP being a distant second. In fact, there are many HTTP stacks implemented in many programming languages. HTTP streaming for video is gaining popularity, perhaps due to firewall issues, but video servers frequently use RTSP running over UDP to maximize throughput. UDP is a good choice for real-time video viewing, but the inherent possibly lost data packets will cause problems for automated indexing systems. Search engines for broadcast monitoring applications may need to grapple with ATSC or DVB access issues.


When generating search results, search engines represent documents by metadata such as title and URL, but they also include a brief summary or extract to enable users to quickly determine if the document is relevant to their query. In the text domain, the operation of extracting representative text segments is straightforward. Regular expressions can be used to efficiently identify text segments matching the user’s query terms, highlight them with markup, and to locate blanks between words to break up long sentences. More sophisticated processing can remove redundancy to form more meaningful extracts. In the video domain, extraction or summarization methods are not well defined and require complex video processing.

The time required to preview video limits the total number of search results that a user is willing to tolerate viewing. Evaluating relevance of a particular document is more time consuming with video than in the text case. Neglecting HTTP site response time, text documents load within a second or two and users may be able to judge instantly if the page is worth reading, and if so, quickly spot-checking several points in the document is usually enough to determine if the document satisfies the query. For video, a much larger amount of data must be downloaded and buffered prior to start-up. After the video starts, the relevant content that the user is looking for is typically not in the first few seconds of playback. Video is normally consumed in a lean-back mode and so the content creators devote more time to lead-in material to pique the viewer’s interest. If a viewer attempts to seek past this content, then re-buffering must take place, and it is unlikely that the desired location will be arrived at on the first attempt. The long lead time required to evaluate document relevance frustrates users of video search.


Duplicate or near duplicate pages in Web search results can frustrate users as they repeatedly see pages that they have already rejected as being irrelevant to their query intent. In the text domain, duplication is trivial to detect and there are well accepted methods for determining document similarity (e.g. based on edit distance) that are reasonably efficient to compute in order to detect near duplicates. Duplicate videos in query results lists present even more of a problem for video search engine users. Videos take a significant amount of time to start playing and the delay will be intolerable for users if they encounter duplicates in query result sets. Sometimes cues from metadata and thumbnails will be enough for users to determine duplications, but not always. Duplicates are common in the video search applications, since a single source of video, say a television broadcast, may be captured by several viewers and posted to numerous sites. Also, the same video may be broadcast repeatedly or at different times for different television markets, so even if the recording time and broadcast channel of a captured video clip is available and accurate, that may not be enough to determine if the content is duplicated. Twenty four hour news channels often rebroadcast footage of breaking news and may intersperse this with new video as it becomes available. Video duplicate detection is an algorithmic challenge and proposed algorithms are computationally intensive. Often a duplicate clip is posted to sharing sites with differing metadata.

Ranking and Indexing

Text information retrieval including ranking and document indexing algorithms are mature, and off-the-self solutions that perform efficiently at scale are available. Video indexing is an emerging technology and universally or widely accepted techniques are not available and may not operate with the scale necessary for practical Web video search. Often the algorithms are domain-specific and cannot be applied to unknown arbitrary video content.

Next post:

Previous post: