Interactive Mobile Visual Search and Recommendation at Internet Scale - Multimedia Database Retrieval: Technology and Applications

Database Reference

In-Depth Information

Table 4.2

MAP

evaluation

of

the

visual-based

and

description-based performance

MAP

@0

@1

@2

@3

@4

Visual-based

96.08

53.06

37.61

29.60

24.59

Description-based

n/a

75.65

72.66

70.78

65.93

word text dictionary was built by extracting the most frequently used words in the

image description.

In order to produce a real restaurant scenario, Dishes in a menu style with both

texts and images were printed out. We took pictures of the dishes as the visual query

and attempted to find the duplicated/near-duplicated images from the dataset. It is

assumed that the best match of the visual recognition result would be user intent.

Such intent was carried by the associated metadata, which were quantized using the

prepared 300-word dictionary. The quantized words were searched with a ranked

list based on the text similarity. The final step was to re-rank the result list using

GPS distance.

Table 4.2 presents the MAP result with the initial visual query and newly

formatted text description query after visual recognition. The table demonstrates

that the performance of the text description-based search is much better than the

visual-based search. This result is reasonable in the sense that text is a better

description than visual content once the ROI is identified and linked with precise

textual metadata. However, the merit of the visual input is its role in filling the

niche when an individual does not have the language tools to express him/herself

articulately. It is demonstrated that during the initial visual search (@0), the visual-

search result is at a high precision rate of 96

08 %. Such accuracy provides a solid

foundation to utilize associated metadata as a description-based query during the

second stage search. In summary, once the visual query is mined accurately, the role

of the search query is then shifted from visual content to text metadata for a better

result.

.

4.4.2.5

Time Complexity Analysis

TapTell 's efficiency performance of the individual component is evaluated.

A detailed analysis is illustrated in Fig. 4.16 . The total time spent on the server end

takes about 1.6 s, including initialization, text-based search, visual-based search,

and OCR-based recognition (the system also supports OCR if the ROI corresponds

to a text region). Among the visual search, local descriptor SIFT extraction takes the

most time, almost 1 s. The communication time between the server and the client

takes about 1.2 s, which is the wireless transmission in our experimental set-up.

Multimedia Database Retrieval: Technology and Applications

Search WWH ::

Custom Search

Home