Interactive Mobile Visual Search and Recommendation at Internet Scale - Multimedia Database Retrieval: Technology and Applications

Database Reference

In-Depth Information

and returning them as search results. In summary, classification solutions focus

on feature ensembles, for instance, the histogram representation of each image.

Retrieval solutions focus on both feature ensemble and individual local descriptor

matches.

Csurka et al. proposed a BoW model-based algorithm for visual image classifi-

cation from seven different classes, including faces, buildings, trees, cars, phones,

bikes and topics [ 115 ]. SIFT feature is used as the local descriptor, and Naïve Bayes,

with non-linear supervised support vector machines (SVM), are used as classifiers.

Deng et al. proposed a database called “ImageNet”, which associates images with

large-scale ontology supported by the WordNet structure [ 116 , 117 ]. Currently,

about nine million images are indexed and this number is still growing. Among

benchmark measurements and comparisons, a spatial pyramid-based histogram

of SIFT local codewords with SVMs classifiers provides the best performance.

Zhou et al. proposed a method by incorporating vector coding to achieve scalable

image classification [ 142 ]. They adopted vector quantization coding on local SIFT

descriptors to map the features to form a high-dimensional sparse vector. Spatial

information of local regions in each image is taken into account and called

spatial pooling. Finally, linear SVMs are used to classify the image representations

obtained from the spatial pooling.

Although non-linear SVMs classifiers perform well, they suffer from data

scalability due to computational complexity. Perronnin et al. proposed several

methods to improve non-linear SVMs, including square-rooting BoW vectors,

kernel-PCA based embedding for additive kernels, and non-additive kernels for

embedding [ 128 , 129 ]. In particular, an algorithm using Fisher Kernels was proposed

to build gradient vectors from features, so that linear SVMs could replace those non-

linear ones as less computational classifiers [ 127 ]. Hence, the scalability issue was

alleviated.

Sivic and Zisserman proposed a video scene retrieval system called Video

Google [ 264 ]. The goal is to retrieve similar objects and scenes and localize their

occurrences in a video. MSER feature detection and SIFT feature description are

used to extract local descriptors. Visual vocabulary is built by K-means clustering.

A term frequency-inverse document frequency (tf-idf) text retrieval algorithm is

used to match each visualword.

Nistér and Stewénius proposed an efficient and scalable visual vocabulary tree,

so that building a large-scale retrieval system using the BoW model is possible

[ 126 ]. The method adopted hierarchical K-means clustering to boost the codebook

generation and retrieval process. The idea is that a query visualword does not

necessarily need to go through the full comparison with the codebook. Rather,

a subset of the codebook (a branch of the hierarchical K-means clustering) is

sufficient. This method allows the codebook to scale up from a few thousands,

to hundreds of thousands, to millions in size without much computational penalty.

Although there is no automatic mechanism to determine the proper codebook size,

in general, a larger vocabulary pool size described by the codebook leads to a better

description of the query image with less quantization error [ 258 ].

Multimedia Database Retrieval: Technology and Applications

Search WWH ::

Custom Search

Home