Local Representation of Facial Features (Face Image Modeling and Representation) (Face Recognition) Part 3

Constructing Gabor Features

Gabor features are constructed by convolution of an input imagetmpdece-327_thumb[2][2]with the filter in (4.9)

tmpdece-329_thumb[2][2]

The convolution produces a response image τξ of the same size. Only a single filter rarely succeeds but the response images are computed for a “bank” of filters tuned on various frequencies and orientations. The frequencies are typically drawn from the logarithmic scale similar to wavelets [15]:

tmpdece-330_thumb[2][2]


where fmax is the maximum frequency (the smallest scale) and c is the frequency scaling factor. Some useful values for c include c = 2 for octave spacing and c = for half-octave spacing. The filter orientations are spaced uniformly

tmpdece-331_thumb[2][2]

For real signals the responses on [π, 2π [ are complex conjugates of responses on [0,π[ and therefore only the responses for the half plane are needed:

tmpdece-333_thumb[2][2]

In (4.15) the columns denote responses over different orientations and rows over different frequencies (scales). This structure is called as “simple Gabor feature space” formally defined in [43], later revised in [39] and utilized in face detection in [32]. A significant simplification made in the proposed feature space is the use of only one spatial location (x’ ,y’) to represent an object. The assumption is justified if the objects are simple or if they are distinguishable from each other in the feature space. This is not the case with, for example, the human face, but seems to hold between salient sub-parts, such as nostrils, eyes, mouth corners, etc. The filters in one location tuned to various frequencies and orientations span a sub-space whose accuracy decreases from the filter origin. This is demonstrated in Fig. 4.12 where an original face is reconstructed using filter responses from 10 locations.

Reconstruction from responses at 10 different locations (four orientations and five frequencies): a original; b reconstruction

Fig. 4.12 Reconstruction from responses at 10 different locations (four orientations and five frequencies): a original; b reconstruction

Operations for rotation and scale invariant searches of objects can be defined as a column-wise circular shift of the response matrix corresponding to the rotation of the object around the location (x0,y0) and a row-wise shift corresponding to the scaling of an object by a factor c [43]. An illumination invariance can be achieved by normalizing the feature matrix [43].

Learning Facial Features

In principle, Gabor features can be used similarly to LBPs or any other local features. The filter responses are computed for various frequencies and orientations, and a descriptor formed from the responses inside one or multiple fixed-size windows as illustrated in Fig. 4.6. For example, Zou et al. [103] proposed a face recognition method using such region descriptor and reported state-of-the-art results for the FERET database: fb: 99.5%, fc: 99.5%, dup I: 85.0% and dup II: 79.5%. Gabor face descriptor is easy to implement, but for completeness, in this section we concentrate on local facial features and utilize the simple feature matrix to represent and learn them.

We assume an annotated training set of face images. The annotations are, for example, the centroids of selected facial landmarks (see Fig. 4.12(a)). Any classifier or pattern recognition method can be used to learn the facial representations from extracted Gabor features. A completely statistical approach, however, possess superior properties as compared to other methods [37]: the decision making has an interpretable basis from which the most probable option can be chosen and a within-class comparison can be performed using statistical hypothesis testing [66]. In the statistical approaches, a class is typically represented in terms of a class conditional probability density function (pdf) over feature space. It should be noted, that finding a proper pdf estimate has a crucial impact on the success of the facial feature detection. Typically, the form of the pdf’s is somehow restricted and the estimation is reduced to a problem of fitting the restricted model to the observed features. Often simple models such as a single Gaussian distribution (normal distributed random variable) can efficiently represent features but a more general model, such as a finite mixture model, must be used to approximate more complex pdf’s. We adopt the method in [37] where Gaussian mixture models represent facial feature conditional pdf’s given the Gabor feature matrix.

The multiresolution Gabor feature in a single location can be converted from the matrix in (4.15) to a feature vector

tmpdece-335_thumb[2][2]

Since the feature vector is complex valued the complex Gaussian distribution function needs to be used,

tmpdece-336_thumb[2][2]

where Σ denotes the covariance matrix. It should be noted that the pure complex form of the Gaussian in (4.17) provides computational stability in the parameter estimation as compared to a concatenation of real and imaginary parts to two real numbers as the dimensionality of the problem doubles in the latter case [66]. Now, a Gaussian mixture model (GMM) probability density function can be defined as a weighted sum of Gaussians

tmpdece-337_thumb[2][2]

where αε is the weight of the cth component. The weight can be interpreted as a priori probability that a value of the random variable is generated by the cth source, and thus,tmpdece-338_thumb[2][2]The    Gaussian mixture model probability density function can be completely defined by the parameter list

tmpdece-340_thumb[2][2]

The main question remains how the parameters in (4.19) can be estimated from the given training data. The most popular estimation method is the expectation maximization (EM) algorithm, but the EM algorithm requires knowledge of the number of Gaussians, C, as an input parameter. The number is often unknown and this is a strong motivation to apply unsupervised methods.

The probability distribution values, likelihoods, can be directly used to find the best or rank facial feature candidates [66]. It is even possible to reduce the search space considerably by discarding image features beyond a requested score level, that is, density quantile [66]. In Fig. 4.13, the use of density quantile for reducing the search space is demonstrated; it is clear that the spatial area corresponding to the 0.05 (0.95 confidence) density quantile contains the correct image feature.

Example of using density quantile of pdf values: a Pdf surface for the left nostril class; b Pdf values belonging to 0.5 density quantile; c Pdf values belonging to 0.05 density quantile [37]

Fig. 4.13 Example of using density quantile of pdf values: a Pdf surface for the left nostril class; b Pdf values belonging to 0.5 density quantile; c Pdf values belonging to 0.05 density quantile [37]

Algorithm 4.1: Train facial feature classifier

Algorithm 4.1: Train facial feature classifier  

Detecting Facial Features

A supervised learning algorithm to extract simple Gabor features (multiresolution Gabor features) and to estimate the class conditional pdf’s for the facial features is presented in Algorithm 4.1. Matlab functionality for efficient computation of the multiresolution Gabor features [76] and for the Gaussian mixture models and the FJ algorithm are publicly available [26]. In Algorithm 4.2, the main steps to extract the features from an image are shown.

Experiments Using the XM2VTS Face Database XM2VTS facial image database is a publicly available database for benchmarking face detection and recognition methods [58]. The frontal part of the database contains 600 training images and 560 test images of size 720 x 576 (width x height) pixels. For facial images ten specific regions (see Fig. 4.12(a)) have been shown to have favorable properties to act as keypoints [32]. A normalized distance between the eyes, 1.0, will be used as measure of image feature detection accuracy. The distance measure is demonstrated in Fig. 4.14(a).

Algorithm 4.2: Extract K best face features of each class from an image

Algorithm 4.2: Extract K best face features of each class from an image

 

Gabor parameters were experimentally selected by using a cross-validation procedure over the training and evaluation sets in the database:tmpdece-344_thumb[2][2] and /high = 1/40. Image features were extracted in a ranked order and a keypoint was considered to be correctly extracted if it was within a pre-set pixel distance limit from the correct location. Results with XM2VTS are presented in Fig. 4.14(b). The distances are scale normalized, so that the distance between centers of the eyes is 1.0 (see Fig. 4.14(a) for a demonstration). On average, 4 correct image features were included in the first 10 image features within distance limit 0.05, but as the number of features was increased to 100: over 9 for 0.05 and almost all features found for 0.10 and 0.20. It should be noted that accuracies of 0.10 and 0.20 are still very good for face registration and recognition. Increasing the number of image features over 100 (10 per class) did not improve the results anymore, but relaxing the distance limit to 0.10 almost perfect result were reached with only 10 first image features from each class. Typical detection results are demonstrated in Figs. 4.14(c)-(e).

Methods for accurate face and facial feature detection and localization based on the described Gabor representations have been proposed and reported to produce state-of-the-art detection accuracy for more difficult and realistic data sets (XM2VTS/non-frontal, BANCA and BioID) [32, 40].

Discussions on Local Features

A drawback of the LBP method, as well as of all local descriptors that apply vector quantization, is that they are not robust in the sense that a small change in the input image would always cause a small change in the output. LBP may not work properly for noisy images or on flat image areas of constant gray level. Many variants of LBP have been proposed to improve its robustness. For instance, Tan and Triggs proposed a three-level operator called local ternary patterns for example, to deal with problems on flat image areas [80]. Liao et al. [52] introduced dominant local binary patterns which make use of the most frequently occurred patterns of LBP to improve the recognition accuracy compared to the original uniform patterns. Raja and Gong proposed sparse multiscale local binary patterns to better exploit the discriminative capacity of multiscale features available [69]. Inspired by LBP, higher order local derivative patterns (LDP) were proposed by Zhang et al., with applications in face recognition [98].

a Demonstration of accuracy distance measure; b Performance for facial feature detection in XM2VTS test images; c, d, e Examples of extracted features (left eye center: blue, right eye outer corner: green, left nostril: red, right mouth corner: cyan, 5 best feature for each landmark numbered from 1 to 5) [37]

Fig. 4.14 a Demonstration of accuracy distance measure; b Performance for facial feature detection in XM2VTS test images; c, d, e Examples of extracted features (left eye center: blue, right eye outer corner: green, left nostril: red, right mouth corner: cyan, 5 best feature for each landmark numbered from 1 to 5) [37] 

LBP has also inspired the development of new effective local face descriptors, such as the Weber Law Descriptor (WLD) containing differential excitation and orientation components [13] and the blur-invariant Local Phase Quantization (LPQ) descriptor [65]. The LPQ descriptor has received wide interest in blur-invariant face recognition [5]. LPQ is based on quantizing the Fourier transform phase in local neighborhoods. Similarly to the widely used LBP based face description, histograms of LPQ labels computed within local regions are also adopted as a face descriptor. The experiments showed that such LPQ descriptors are highly discriminative and produce very promising face recognition results, outperforming LBP both with blurred and sharp images on CMU PIE and FRGC 1.0.4 datasets.

A current trend in the development of new effective local face image descriptors is to combine the strengths of complementary descriptors. From the beginning, the LBP operator was designed as a complementary measure of local image contrast. Applying LBP to Gabor-filtered face images, or using LBP and Gabor methods jointly, have provided excellent results in face recognition [81, 96]. For instance, Zhang et al. [96] proposed the extraction of LBP features from images obtained by filtering a facial image with 40 Gabor filters of different scales and orientations. Excellent results have been obtained on the all FERET sets. A downside of the method lies in the high dimensionality of the feature vector (LBP histogram) which is calculated from 40 Gabor images derived from each single original image. To overcome this problem of large feature dimensions, Shan et al. [73] presented a new extension using Fisher Discriminant Analysis (FDA), instead of the χ2 (Chi-square), and histogram intersection, which have been previously used in [96]. The authors constructed an ensemble of piecewise FDA classifiers, each of which is built based one segment of the high-dimensional LBP histograms. Impressive results were reported on the FERET database. Other works have also successfully exploited the complementary of Gabor filters and LBP features by fusing the two set of features e.g. for age classification [86]. Combining ideas from Haar and LBP features have also given excellent results in accurate and illumination invariant face detection [71, 91].

Features based on Gabor filters are very versatile. By post-processing they can be transformed, for example, to binary descriptors of texture similar to LBPs. For example, in the Daugman’s iris code the response phase is quantized to two bits (four quadratures in the complex plane) [18]. The Daugman’s descriptor is very discriminative and its histograms were used in face recognition in [97]. Utilization of the phase information is important for discrimination, but many other efficient post-processing methods exist in the literature and they are used in human visual system oriented recognition methods [72]. Another important property of Gabor filters is that the original signal can be reconstructed. This property was employed in this topic where we introduced the efficient facial feature descriptor based on Gabor features at a single location. Recently, the importance of phase information have been noticed and very good recognition results reported for features based on Gabor phase [96]. It is important to notice that the complex-valued response, including both magnitude and phase, is the most natural representation, and should be used in methods based on Gabor filters.

Conclusions

Finding efficient facial or facial feature representations is a key issue in developing robust face recognition systems. Many methods have been proposed for this purpose. Local feature based methods seem to be more robust against variations in pose or illumination than holistic methods. Especially methods based on Gabor filter responses and local binary patterns have been particularly successful in face image processing.

Next post:

Previous post: