Database Reference
In-Depth Information
and returning them as search results. In summary, classification solutions focus
on feature ensembles, for instance, the histogram representation of each image.
Retrieval solutions focus on both feature ensemble and individual local descriptor
matches.
Csurka et al. proposed a BoW model-based algorithm for visual image classifi-
cation from seven different classes, including faces, buildings, trees, cars, phones,
bikes and topics [ 115 ]. SIFT feature is used as the local descriptor, and Naïve Bayes,
with non-linear supervised support vector machines (SVM), are used as classifiers.
Deng et al. proposed a database called “ImageNet”, which associates images with
large-scale ontology supported by the WordNet structure [ 116 , 117 ]. Currently,
about nine million images are indexed and this number is still growing. Among
benchmark measurements and comparisons, a spatial pyramid-based histogram
of SIFT local codewords with SVMs classifiers provides the best performance.
Zhou et al. proposed a method by incorporating vector coding to achieve scalable
image classification [ 142 ]. They adopted vector quantization coding on local SIFT
descriptors to map the features to form a high-dimensional sparse vector. Spatial
information of local regions in each image is taken into account and called
spatial pooling. Finally, linear SVMs are used to classify the image representations
obtained from the spatial pooling.
Although non-linear SVMs classifiers perform well, they suffer from data
scalability due to computational complexity. Perronnin et al. proposed several
methods to improve non-linear SVMs, including square-rooting BoW vectors,
kernel-PCA based embedding for additive kernels, and non-additive kernels for
embedding [ 128 , 129 ]. In particular, an algorithm using Fisher Kernels was proposed
to build gradient vectors from features, so that linear SVMs could replace those non-
linear ones as less computational classifiers [ 127 ]. Hence, the scalability issue was
alleviated.
Sivic and Zisserman proposed a video scene retrieval system called Video
Google [ 264 ]. The goal is to retrieve similar objects and scenes and localize their
occurrences in a video. MSER feature detection and SIFT feature description are
used to extract local descriptors. Visual vocabulary is built by K-means clustering.
A term frequency-inverse document frequency (tf-idf) text retrieval algorithm is
used to match each visualword.
Nistér and Stewénius proposed an efficient and scalable visual vocabulary tree,
so that building a large-scale retrieval system using the BoW model is possible
[ 126 ]. The method adopted hierarchical K-means clustering to boost the codebook
generation and retrieval process. The idea is that a query visualword does not
necessarily need to go through the full comparison with the codebook. Rather,
a subset of the codebook (a branch of the hierarchical K-means clustering) is
sufficient. This method allows the codebook to scale up from a few thousands,
to hundreds of thousands, to millions in size without much computational penalty.
Although there is no automatic mechanism to determine the proper codebook size,
in general, a larger vocabulary pool size described by the codebook leads to a better
description of the query image with less quantization error [ 258 ].
Search WWH ::




Custom Search