Introduction to Face Recognition Part 2

Solution Strategies

There are two strategies for tackling the challenges outlined in Sect. 1.5: (i) extract invariant and discriminative face features, and (ii) construct a robust face classifier. A set of features, constituting a feature space, is deemed to be good if the face manifolds are simple (i.e., less nonlinear and nonconvex). This requires two stages of processing: (1) normalizing face images geometrically and photometrically (for example, using geometric warping into a standard frame and photometric illumination correction) and (2) extracting features in the normalized images, such as using Gabor wavelets and LBP (local binary pattern), that are stable with respect to possible geometric and photometric variations.

A powerful classification engine is still necessary to deal with difficult nonlinear classification and regression problems in the constructed feature space. This is because the normalization and feature extraction cannot solve the problems of nonlinearity and nonconvexity. Learning methods are useful tools to find good features and build powerful robust classifiers based on these features. The two stages of processing may be designed jointly using learning methods.

In the early development of face recognition [6, 13, 18, 36], geometric facial features such as eyes, nose, mouth, and chin were explicitly used. Properties of the features and relations (e.g., areas, distances, angles) between the features were used as descriptors for face recognition. Advantages of this approach include economy and efficiency when achieving data reduction and insensitivity to variations in illumination and viewpoint. However, facial feature detection and measurement techniques developed to date are not sufficiently reliable for the geometric feature-based recognition [9]. Further, geometric properties alone are inadequate for face recognition because rich information contained in the facial texture or appearance is not utilized. These are the main reasons why early feature-based techniques were not effective.


Statistical learning methods are the mainstream approach that has been used in building current face recognition systems. Effective features and classifiers are learned from training data (appearance images or features extracted therefrom). During the learning, both prior knowledge about face(s) and variations encountered in the training data are taken into consideration. The appearance-based approach, such as PCA [42] and LDA [3] based methods, has significantly advanced face recognition technology. Such an approach generally operates directly on an image-based representation (i.e., array of pixel intensities). It extracts features in a subspace derived from training images. Using PCA, an “optimal” face subspace is constructed to represent only the face object; using LDA, a discriminant subspace is constructed to distinguish faces of different persons. It is now well known that LDA-based methods generally yields better results than PCA-based methods [3].

These linear, holistic appearance-based methods encode prior knowledge contained in the training data and avoid instability of manual selection and tuning needed in the early geometric feature-based methods. However, they are not effective in describing local variations in the face appearance and are unable to capture subtleties of face subspaces: protrusions of nonconvex manifolds may be smoothed out and concavities may be filled in, thereby loosing useful information. Note that the appearance-based methods require that the face images be properly aligned, typically based on the eye locations.

Nonlinear subspace methods use nonlinear transforms to convert a face image into a feature vector in a discriminative feature space. Kernel PCA [37] and kernel LDA [29] use kernel tricks to map the original data into a high-dimension space to make the data separable. Manifold learning, which assumes that face images occupy a low-dimensional manifold in the original space, attempts to model such manifolds. These include ISOMAP [39], LLE [35], and LPP [15]. Although these methods achieve good performance on the training data, they tend to overfit and hence do not generalize well to unseen data.

The most successful approach to date for handling the nonconvex face distribution works with local appearance-based features extracted using appropriate image filters. This is advantageous in that distributions of face images in local feature space are less affected by the changes in facial appearance. Early work in this direction included local features analysis (LFA) [33] and Gabor wavelet-based features [21,45]. Current methods are based on local binary pattern (LBP) [1] and many variants (for example ordinal feature [23], Scale-Invariant Feature Transform (SIFT) [26], and Histogram of Oriented Gradients (HOG) [10]). While these features are general-purpose and can be extracted from arbitrary images, face-specific local filters may be learned from images [7, 20].

A large number of local features can be generated by varying parameters associated with the position, scale, and orientation of the filters. For example, more than 400 000 local appearance features can be generated when an image of size 100 x 100 is filtered with Gabor filters with five different scales and eight different orientation for all pixel positions. While some of these features are useful for face recognition, others may be less useful or may even degrade the recognition performance. Boosting based methods have been implemented to select good local features [46,48,49]. A discriminant analysis step can be applied to further transform the space of the selected local features to discriminative subspace of a lower dimensionality to achieve better face classification [22, 24, 25]. This leads to a framework for learning both effective features and powerful classifiers.

There have been only a few studies reported on face recognition at a distance. These approaches can be essentially categorized into two groups: (i) generating a super resolution face image from the given low resolution image [11, 32] and (ii) acquiring high resolution face image using a special camera system (e.g., a high resolution camera or a PTZ camera) [4, 14, 28,40, 47].

The availability of high resolution face images (i.e., tens of megapixels per image) provides new opportunities in facial feature representation and matching. In the 2006 Face Recognition Vendor Test (FRVT) [31], the best face matching accuracies were obtained from the high resolution 2D images or 3D images. This underlines the importance of developing advanced sensors as well as robust feature extraction and matching algorithms in achieving high face recognition accuracy. The increasing popularity of infrared cameras also supports the importance of sensing techniques.

Current Status

For cooperative scenarios, frontal face detection and tracking in normal lighting environment is a reasonably well-solved problem. Assuming the face is captured with sufficient image resolution, 1:1 face verification also works satisfactorily well for cooperative frontal faces. Figure 1.8 illustrates an application of face verification at the 2008 Beijing Olympic Games. This system verifies the identity of a ticket holder (spectator) at entrances to the National Stadium (Bird’s Nest). Each ticket is associated with a unique ID number, and the ticket holder is required to submit his registration form with a two-inch ID/passport photo attached. The face photo is scanned into the system. At the entrance, the ticket is read in by an RFID reader, and the face image is captured using a video camera, which is compared with the enrollment photo scan, and the verification result is produced.

A novel solution to deal with uncontrolled illumination is to use active near infrared (NIR) face imaging to control the illumination direction and the strength. This enables the system to achieve high face recognition accuracy. The NIR face recognition technology has been in use at China-Hong Kong border2 for self-service immigration clearance since 2005 (see Fig. 1.8).

Face verification used at the 2008 Beijing Olympic Games, and 1:1 NIR face verification used at the China-Hong Kong border control since 2005

Fig. 1.8 1 :1 Face verification used at the 2008 Beijing Olympic Games, and 1:1 NIR face verification used at the China-Hong Kong border control since 2005

An embedded NIR face recognition system for access control in 1:N identification mode and watch-list face surveillance and identification at subways

Fig. 1.9 An embedded NIR face recognition system for access control in 1:N identification mode and watch-list face surveillance and identification at subways

One-to-many face identification using the conventional, visible band face images has not yet met the accuracy requirements of practical applications even for cooperative scenarios. The main problem is the uncontrolled ambient illumination. The NIR face recognition provides a good solution, even for 1:N identification. Embedded NIR face recognition based access control products (Fig. 1.9) have been on the market since 2008.

Face recognition in noncooperative scenarios, such as watch-list identification, remains a challenging task. Major problems include pose, illumination, and motion blur. Because of growing emphasis on security, there have been several watch-list identification application trials. On the right of Fig. 1.9, it shows a snapshot of 1:N watch-list face surveillance and identification at a Beijing Municipal Subways station, aimed at identifying suspects in the crowd. CCTV cameras are mounted at the subway entrances and exits, in such a way that images of frontal faces are more likely to be captured. The best system could achieve a recognition rate of up to 60% at a FAR = 0.1%.

Summary

Face recognition technology has made impressive gains, but it is still not able to meet the accuracy requirements of many applications. A sustained and collaborative effort is needed to address many of the open problems in face recognition.

Next post:

Previous post: