Database Reference
In-Depth Information
Table 4.2
MAP
evaluation
of
the
visual-based
and
description-based performance
MAP
@0
@1
@2
@3
@4
Visual-based
96.08
53.06
37.61
29.60
24.59
Description-based
n/a
75.65
72.66
70.78
65.93
word text dictionary was built by extracting the most frequently used words in the
image description.
In order to produce a real restaurant scenario, Dishes in a menu style with both
texts and images were printed out. We took pictures of the dishes as the visual query
and attempted to find the duplicated/near-duplicated images from the dataset. It is
assumed that the best match of the visual recognition result would be user intent.
Such intent was carried by the associated metadata, which were quantized using the
prepared 300-word dictionary. The quantized words were searched with a ranked
list based on the text similarity. The final step was to re-rank the result list using
GPS distance.
Table 4.2 presents the MAP result with the initial visual query and newly
formatted text description query after visual recognition. The table demonstrates
that the performance of the text description-based search is much better than the
visual-based search. This result is reasonable in the sense that text is a better
description than visual content once the ROI is identified and linked with precise
textual metadata. However, the merit of the visual input is its role in filling the
niche when an individual does not have the language tools to express him/herself
articulately. It is demonstrated that during the initial visual search (@0), the visual-
search result is at a high precision rate of 96
08 %. Such accuracy provides a solid
foundation to utilize associated metadata as a description-based query during the
second stage search. In summary, once the visual query is mined accurately, the role
of the search query is then shifted from visual content to text metadata for a better
result.
.
4.4.2.5
Time Complexity Analysis
TapTell 's efficiency performance of the individual component is evaluated.
A detailed analysis is illustrated in Fig. 4.16 . The total time spent on the server end
takes about 1.6 s, including initialization, text-based search, visual-based search,
and OCR-based recognition (the system also supports OCR if the ROI corresponds
to a text region). Among the visual search, local descriptor SIFT extraction takes the
most time, almost 1 s. The communication time between the server and the client
takes about 1.2 s, which is the wireless transmission in our experimental set-up.
 
Search WWH ::




Custom Search