Interactive Mobile Visual Search and Recommendation at Internet Scale - Multimedia Database Retrieval: Technology and Applications

Database Reference

In-Depth Information

From a retrieval point of view, although the BoW model has shown promising

results in desktop-based visual searches for large-scale consortia, it also suffers a

semantic gap. The BoW model is limited by its homogenous process in treating

all regions without distinction. Features are extracted homogeneously, and local

features are treated without emphasis. Therefore, information provided by a query

image without priority can mislead the computer vision algorithm for recognition.

Hence, to have a better retrieval result, there is a need to orderly utilize local

visual information. Multi-touch screen and its user interaction on mobile-devices

offer such a platform for users to select their ROIs as prioritized information, with

surrounding context as secondary information.

From a mobile application perspective, visual search via image query provides

a powerful complementary carrier besides conventional textual and vocal queries.

Compared to conventional text or voice queries for information retrieval on-the-go,

there are many cases where visual queries can be more naturally and conveniently

expressed via mobile device camera sensors (such as an unknown object or text,

an artwork, a shape or texture, and so on) [ 135 ]. In addition, mobile visual

search has a promising future due to the vital roles mobile devices play in our

life, from their original function of telephony, to prevalent information-sharing

terminals, to hubs that accommodate tens of thousands of applications. While on

the go, people are using their phones as a personal concierge discovering what

is around and deciding what to do. Therefore, the mobile phone is becoming a

recommendation terminal customized for individuals—capable of recommending

contextually relevant entities (local businesses such as a nearby restaurant or hotel)

and simplifying the accomplishment of recommended tasks. As a result, it is

important to understand user intent through its multi-modal nature and the rich

context available on the phone.

Motivated by the above observations, this chapter presents an interactive search-

based visual recognition and contextual recommendation using the BoW model,

targeting internet scale large image collection. Smart-phone hardware such as

camera and touch screen, are taken advantage of in order to facilitate expressions of

user's ROI from the pictures taken. Then, the visual query along with such a ROI

specification go through an innovative contextual visual retrieval model to achieve

a meaningful connection to database images and their associated rich text informa-

tion. Once the visual recognition is accomplished, associated textual information of

retrieved images are further analyzed to provide meaningful recommendations.

An actual system codename TapTell is implemented based on the algorithms

and methodologies described in Sect. 4.2 . A natural user interaction is adopted

to achieve the Ta p action, in which three gestures are investigated (i.e., circle,

line, and tap). It is concluded that the circle (also called “O” gesture) is the most

natural interaction for users, which integrates user preference to select the targeted

object. The BoW model and a novel context-embedded vocabulary tree approach

is adopted. The algorithm incorporates both ROI visual query and the context from

surrounding pixels of the “O” region to search similar images from a large-scale

image dataset. Through this user interaction (i.e., “O” gesture) and the BoW model

with our innovative algorithm, standard visual recognition can be improved. The Te l l

Multimedia Database Retrieval: Technology and Applications

Search WWH ::

Custom Search

Home