Fisher Vectors: Beyond Bag-of-Visual-Words Image Representations (Computer Vision,Imaging and Computer Graphics) Part 2

Image Categorization with FV

Figure 2 illustrated the schema of our Generic Visual Categorization system. In which follows we present the results of several categorization experiments with this system using FVs. In all the experiments we used as low level features either SIFT-like Orientation Histograms (ORH) alone or in combination with local RGB statistics (COL). These low level features were extracted from 32 x 32 pixel patches on a regular grids (every 16 pixels) at five different scales. Both ORH and COL features were reduced to 50 or 64 dimensions with Principal Component Analysis (PCA). The ORH and COL features were merged at score level, whenever not precised otherwise.

To take into account the rough geometry of a scene we also used a spatial pyramid similar to the one proposed in [25]. However, instead of representing the regions at each layer with BOV histograms, we concatenate the Fisher Vectors as follows. We repeatedly subdivide the image following the splitting strategy adopted by the winning systems of PASCAL VOC 2008 [26] and hence extract 8 Fisher Vectors per image: one for the whole image (1×1), three for the top, middle and bottom regions (1×3) and four for each of the four quadrants (2×2). We power and L2 normalize them independently and concatenate either per layer leading to three FV (one per layer) or all the eight FV together to get a single image representation per feature type (ORH or COL). We will refer to the latter representation by SP-FV (Spatial Pyramid of Fisher Vector).


As we participated with success in several image categorization and annotation challenges, we will structure the experiments below accordingly showing comparisons with the state-of-the-art.

The Pascal VOC Challenge. The collection used by the Pascal VOC 2007 Challenge [7] contains around 5K training provided with a set of manual labels including person, vehicle classes (aeroplane, bicycle, boat, bus, car, motorbike, train), animal classes (bird, cat, cow, dog, horse, sheep) or divers indoor objects (bottle, chair, dining table, potted plant, sofa, tv/monitor). For the 5K test set, the aim is to select from these 20 class labels those ones which are present in each test image. Hence this is clearly a multi-class multi-label classification task. In the challenge the mean Average Precision (mAP) over the 20 classes was used to evaluate the performance of different systems.

Table 1. Comparison of the proposed FV with the state-of-the-art methods on PASCAL VOC 2007

Method

mean AP (in %)

FV* (ORH) + linear [13]

47.9

FV (ORH) + linear [14]

55.3

SP-IFV (ORH) + linear [14]

58.3

FV* (ORH + COL) + linear [14]

45.9

FV (ORH +COL) + linear [14]

58.0

SP-IFV (ORH +COL ) + linear [14]

60.3

FV* (ORH + COL) + non-linear at VOCO7

55.7

Best of VOCÛ7 [27]

59.4

MKL [3]

62.2

non-lin SVMs + localization [28]

63.5

Table 1 reports results on the 2007 dataset. We denote by FV* the Fisher vector without power and L2 normalization. First, we can see that the power and L2 normalization significant increase in classification accuracy. Considering spatial pyramids of FVs (SP-FV) allows for further improvements. If we compare SP-FV with the best results reported in the literature on this dataset, our system performs very well, considering that the two systems that achieve better accuracy use several low level features and are much more complex. Indeed, [3] uses sophisticated Multiple Kernel Learning (MKL) algorithm and [28] combines the results of several non-linear classifiers with a costly sliding-window-based object localization system.

Finally, to test how the system performs if we increase the training data, for each of the 20 VOC classes we collected additional data from Flickr groups up to 25K images per category3. Adding more data helped significantly, as we achieved a mAP of 63.5% that is similar to [28]. Note that in spite of the increased training set, the training cost remains reasonable (about 22h on a CPU of a 2.5GHz Xeon machine with 32GB of RAM) and the test time remains unchanged (190ms including feature extraction and SP-IFV computation).

While this is a way to increase the accuracy at low cost, it is definitely not sufficient as shown in Table 2.

Table 2. The SP-FV at the Pascal VOC 2010

Method

SP-FV

SP-FV + Flickr data

best at VOClO

mAP (in %)

61.2

68.3

73.8

Indeed, if we analyze the Pascal VOC 2010 results, we can see that adding the 1M Flickr images4 improves significantly our results. However, on the other hand the latter is significantly outperformed by the best system5 using only the provided Pascal VOC data [7]. Nevertheless, the computational cost of their methods is quite important, as they not only include several non-linear classifiers, Multiple Kernel Learning and sliding-window-based object localization, but also multiple low level image segmentation with MeanShift and Graph Cuts which are known as costly operations. Hence, it is difficult to asses the scalability of this method to a very large data sets with many categories such that the ImageNet Large Scale Visual Recognition (LVRS) Challenge.

The ImageNet Large Scale Visual Recognition Challenge. The goal of the Large Scale Visual Recognition (LVRS) Challenge 20106 competition is to assign an image to one of the 1,000 “leaf” categories from ImageNet dataset. In contrast to the Pascal Challenge, this is a multi-class mono-label problem as each image has a single label. To evaluate the accuracy, two cost measures were considered (lower is better). The first one is a flat cost which averages the number of wrongly labeled images (the cost is 1 for each wrongly labeled image. However, the class labels coming from a hierarchical taxonomy (WordNet), the aim was also to penalize more a classification error that predicts labels ontologically far from the correct concept than those which are close (e.g. it is less hurting for an image containing a lion to predict the tiger label than truck or mailbox). Hence, a second measure was also considered which is a hierarchical cost, where the cost for a wrong label depends on the height of the closest least common ancestor in WordNet of the predicted label with the correct label. Best results of the top 4 participants are shown in Table 3. Our system with SP-FV performed second out of the 11 participants.

The ImageClef Photo Annotation Task. The ImageClef visual concept detection and annotation task[29] challenged the participants with the MIRFLICKR-25000 Image Collection 7 that contains images selected based on their high interestingness rating.

Method

NEC-UIUC

XRCE

ISIL

UCI

flat cost

0.282

0.336

0.445

0.466

hierarchical cost

2.114

2.555

3.654

3.629

Run

Modality

mAP

EER


AUC

F-ex

OS

FV (late)

V

39.0

25.8

80.9

62.7

63.8

FV (early)

V

38.9

26.3

80.5

63.9

64.5

UVA [31]

V

40.7

24.4

82.6

68.0

59.1

FV + T (early)

V&T

45.5

23.9

82.9

65.5

65.6

FV +T (late)

V&T

43.7

24.3

82.6

62.4

63.7

MEIJE [32]

V&T

32.6

35.9

63.7

57.2

36.6

The ImageClef Medical Image Modality Classification Sub-Task. Image modality is an important aspect of the image for medical retrieval. In user-studies, clinicians have indicated that visual modality is one of the most important filters that they would like to use to limit their search by. Many image retrieval websites (Goldminer, Yottalook) allow users to limit the search results to a particular modality. This modality is typically extracted from the caption and is often not correct or present. The aim of this challenge was to evaluate if the image content itself can be used instead or in combination with the textual information. Therefore, participants were provided a training set of 2000 images that were labeled by one of the 8 modalities (CT, MR, XR etc) and they had to classify a set of 2000 test images into one of those modalities. They could use either visual information, textual (image captions) information or both. Our SP-FV based approach was the best performing visual only system as shown in Table 5 and also combined with our textual run (T) the best multi-modal system (see details in [33]).

Table 5. The SP-FV at ImageClef Medical Image Modality Classification Sub-Task. The best visual and mixed modality runs from other participants are also shown for comparison.

RUN

Modality

ACC

SP-FV

Visual

0.87

T

Textual

0.90

SP-FV + T

Mixed

0.94

UESTC

Visual

0.82

RitsMIP

Mixed

0.93

Image Retrieval

In this section we present the results of several image retrieval experiments with uncompressed and compressed Fisher Vectors.

The IAPR TC12 Benchmark Photo Repository. The IAPR TC-12 photographic collection [34] consists of 20,000 still natural images taken from locations around the world including pictures of different sports and actions, photographs of people, animals, cities, landscapes and many other aspects of contemporary life. Each image has an associated title, creation date, location, photographer name and a semantic description of the contents of the image as determined by the photographer. The aim of the challenge was to retrieve relevant images for 60 query topics while using either mono-modal features (e.g. visual or textual) or both modalities. In Table 6 we show the results on this dataset with pure visual information or combined with the textual retrieval either using late fusion or cross-modal similarities as described in [35]. We can see that the state-of-the art results obtained with FV* in the challenge (they were winning runs) were further outperformed by SP-FV both in pure visual retrieval and when combined with textual information.

The ImageClef Wikipedia Retrieval Task. The ImageClef Wikipedia Retrieval [34] task consists of multilingual and multimedia retrieval. The collection contains over 237,000 Wikipedia images that cover diverse topics of interest These images were extracted from Wikipedia in different languages namely French, English and German with their captions.

Table 6. ImageClef Photo Retrieval with Fisher Vectors

visual

Map

P20

late

Map

P20

cross

Map

P20

FV*

0.18

0.326

FV* +T

0.348

0.45

FV* +T

0.33

0.47

SP-FV

0.22

0.382

SP-FV +T

0.352

0.46

SP-FV +T

0.35

0.51

In addition, participants were provided with the original Wikipedia pages in wikitext format. The task consisted in retrieving as many relevant images as possible from the aforementioned collection, given a textual query translated in the three different languages and one or several query images. As the results from previous years have shown ure visual systems obtained very poor results on this task. Willing to test our FV based approach on such a difficult task, we also obtained poor results (mAP=5.5%) even if our results were far better than the second best pure visual system (mAP=1.2%). However we have shown that when we appropriately combine these results with text based retrieval we are able to boost the retrieval scores from (mAP=20.4% to 27.9%). Our fusion strategy was simply to first filter the image scores by the test scores and then after appropriate normalization combine the filtered image scores with text scores with late fusion (see details in [33]).

The Holiday Dataset Experiment with Binarized FVs. The Holiday dataset [36] contains 1,491 images of 500 scenes and objects and the first image of each scene is used as a query. The retrieval accuracy is measured with mean Average Precision (AP) using 500 queries (one for each scene) using leave-one-out cross-validation scheme. Figure 3(left) compare our results with the recent and state-of-the-art method [19] which are based on compressed BOV vectors. We can see that our system with the binarized FVs performs significantly better for a similar number of bits representation (see further experiments and analyzes in [18]).

We further experimented with a large-scale experiment, were the Holiday dataset was additionally extended with a set of 1M “distractor” Flickr images (referred to as Flickr1M) made available by [36]. The same 500 Holiday images are queried and the 1M Flickr images are used as distractors and we used the recall@K for various vocabulary sizes N as measure to evaluate (as it was the case of [19]). Figure 3(right) compare the two methods on this extended dataset. Again, we observe a very significant improvement of the recall@K for a comparable number of bits.

Other Applications

In the past few years we successfully applied the Fisher Vectors in several other applications. We can mention here the semantic segmentation or intelligent image thumbnailing.

Semantic Image Segmentation. Is assigning each pixel in an image to a set of predefined semantic object categories. State-of-the-art semantic segmentation algorithms typically consist of three components: local appearance model, local consistency model and global consistency model. These three components are generally integrated into a unified probabilistic framework.

Comparison of the proposed binarized Fisher vectors (bin FV) and the results of [19] (comp BOV). Left: The Holiday dataset evaluated with average precision. Right The Holiday data extended with Flickr1M and evaluated with recall.

Fig.3. Comparison of the proposed binarized Fisher vectors (bin FV) and the results of [19] (comp BOV). Left: The Holiday dataset evaluated with average precision. Right The Holiday data extended with Flickr1M and evaluated with recall.

While it enables at training time a joint estimation of the model parameters and ensures at test time a globally consistent labeling of the pixels, it also comes at a high computational cost e.g. [37, 38]. As described in [39], we proposed a simple approach to semantic segmentation where the three components are decoupled. The pipeline as is quite similar to our Generic Visual Categorization system as illustrated in Figure 2. Again, we use the Fisher Vectors to represent patches and sum them to represent images. The main difference is that linear classifiers are trained both at patch level and at image level. The former allows us to score each local patch according to its class relevance, where the posterior patch probabilities are further propagated to pixels leading to class probability maps. The latter learns the global context of the object class and allows for early rejection of class probability maps for which the likelihood of the object presence is low. Finally, the retained class probability maps are combined with low level segmentation to improve the label consistency in homogeneous regions. This method was best performing at the 2008 Pascal VOC Segmentation Challenge [7]. While recent methods (e.g. [40]) showed significantly better performances in the two last Pascal VOC Segmentation challenge, our method remains appealing by its simplicity and low computational cost.

Intelligent Image Thumbnailing. Consists in the identification of one or more regions of interest in an input image: salient parts are aggregated in foreground regions, whereas redundant and non informative pixels become part of the background. The range of applications where thumbnailing can be employed is broad including traditional problems like image compression, visualization, summarization and more recent applications like variable data printing or assisted content creation. In [41] we have proposed a novel framework for visual saliency detection based on a simple principle: images sharing their global visual appearances are likely to share similar salience. Following this principle for each training image the K most similar images are retrieved from an indexed database using Fisher Vector. These images having a strong labeling (patches were manually labeled as salient or not). Hence we collect the FVs (using ORH and COL features as above) and average them to obtain a salient (foreground) model and non-salient (background) model. Then for each patch in the test image a saliency score is computed based on its similarity to the foreground and background models. These scores are further propagated from patches (or sub-windows) to pixels generating a smooth saliency map (see further details in [41]).

Conclusions

In this paper we have shown several successful application using Fisher Vectors. Indeed, we obtained state-of-the-art results with them in several image classification and retrieval challenges From these results we can see that using power and L2 normalization and spatial pyramid boosts significantly the categorization and retrieval performances. We further have evaluated uncompressed and compressed FVs on large scale datasets showing that they are indeed suitable for such applications. In addition, we have shown that when textual data is available, we can take advantage of both modalities and obtain significant improvements over mono-modal systems. Finally, we briefly presented two extra applications in which the Fisher Vectors were successfully used.

Next post:

Previous post: