User Perspectives (Video Search Engine Systems)

Interaction States

Let’s consider how a user interacts with a video search engine. We can model this as a set of behavioral states as shown in Fig. 4.6 where the retrieval activities are shown in white and the contribution activities are shown in grey. The states are as follows:

•    Q: Query - the user is presented with a query interface (e.g. a text box in an HTML form) and must formulate or express the query to the system.

•    B: Browse - the service presents a set of rank ordered results in response to the query. Metadata and thumbnail images are displayed as a list and the user can interact via scrolling, paging, etc.

•    V: View - the user has selected a particular item and the system initiates video playback using a media player.

•    A: Annotation - the user may tag, rate, review or otherwise comment on the video.

•    E: Edit - the user composes a video using a video editing tool, editing both the content and the metadata.

•    U: Upload - the user publishes their video content and provides directives for intended audience, content categories, etc.

User activities during video search and contribution.


Fig. 4.6. User activities during video search and contribution.

In the figure, only the primary flows are indicated and sites typically allow users to navigate among all the states at will. Also the flow implies a traditional capture-edit-upload content contribution flow (capture is not shown here) which is typical of most user contributed content and professionally produced Web video content. While most video editing today takes place locally prior to contribution, many sites offer video editing of content stored remotely up on the server, so we have included this in the figure where it is implied that there are Web applications that support each state. This user interaction flow follows the classical model (which can be exploited to improve performance [Agi06]) but does not capture concepts commonly employed such as personalization based on user preferences, or the notion of a portal displaying popular or promotional content. The latter case can fit the model here as a special case of the browse state which the user enters with a null query or as an initial state. Further, many systems are constructed to support parallelism in the user interaction. For highperformance retrieval applications, or for immersive entertainment focused applications, perhaps where full-screen video replay is employed to provide a lean-back, TV-like experience, one or more activities can take place simultaneously. In this scenario the video is “always on” and the user may guide the thread of replay using queries or browsing to other selections as an alternative to using the “channel-up” button on their TV remote control.

Granularity of Search Results Representation

As we imagine the user navigating through this list of relevant content identified from a vast sea of video material, we can think in terms of a play list – or an edit decision list. In the former case, the system selects content in response to user queries, and rank orders them for replay to the user. In the latter, relevant segments are identified and selected including “in” and “out” points.

Levels of granularity for representing video search results.

Fig. 4.7. Levels of granularity for representing video search results.

As shown in Fig. 4.7, systems may supply additional information to represent the matches to a user query. The figure depicts a result set with content of different length and a potential path (dashed line) for playing out the media to the user. The levels represented here are:

1.    Sets of content identifiers (CIDs) indicating which assets in the database match the user query.

2.    Lists of clips specified by offset and duration or in and out points (shown in grey) indicating the most relevant segments of each media file.

3.    Lists of “hits” or feature-level matches. For text, using word features, these are the matching words or phrases to the query; in the case of a high-level image concept, these may be matching video frames. So these “hits” while represented in the figure as impulsive events, may indeed have an implicit duration, albeit small – at the word or frame level.

Note that the figure implies a binary (thresholded) decision as to what constitutes a match, but systems may also preserve a measure of the likelihood of match for each level. Application designers must bear in mind that the accuracy of such measures may be difficult to determine accurately. In our example, we may decide that the second clip in CIDi is of lower rank and should be omitted from the playback. We have lists of identifiers, and lists of temporal intervals with measures of match value. Further, the portion of the query that generated the match can be represented, as can be the portion of the content (e.g. the spatial coordinates of a region of interest of an image containing a face that matches a query.) For practical reasons, many systems discard this detailed query match information as quickly as possible during the stages of query processing and rendering of results.

Next post:

Previous post: