Empirical Comparison of Subspace Methods
Moghaddam [23] reported on an extensive evaluation of many of the subspace methods described above on a large subset of the FERET data set [31] (see also Chap. 13).
Fig. 2.10 Experiments on FERET data. a Several faces from the gallery. b Multiple probes for one individual, with different facial expressions, eyeglasses, variable ambient lighting, and image contrast. c Eigenfaces. d ICA basis images
The experimental data consisted of a training “gallery” of 706 individual FERET faces and 1123 “probe” images containing one or more views of every person in the gallery. All these images were aligned and normalized as described by Moghaddam and Pentland [25]. The multiple probe images reflected various expressions, lighting, glasses on/off, and so on. The study compared the Bayesian approach described in Sect. 2.3.4 to a number of other techniques and tested the limits of the recognition algorithms with respect to image resolution or equivalently the amount of visible facial detail. Because the Bayesian algorithm was independently evaluated in DARPA’s 1996 FERET face recognition competition [31] with medium resolution images (84 x 44 pixels)—achieving an accuracy of «95% on 0(103) individuals— it was decided to lower the resolution (the number of pixels) by a factor 16. Therefore, the aligned faces in the data set were downsampled to 21 x 12 pixels, yielding input vectors in a RN=252 space. Several examples are shown in Fig. 2.10a, b.
The reported results were obtained with a fivefold CrossValidation (CV) analysis. The total data set of 1829 faces (706 unique individuals and their collective 1123 probes) was randomly partitioned into five subsets with unique (nonoverlapping) individuals and their associated probes. Each subset contained both gallery and probe images of «140 unique individuals. For each of the five subsets, the recognition task was correctly matching the multiple probes to the «140 gallery faces using the other four subsets as training data. Note that with N = 252 and using 80% of the entire dataset for training, there are nearly three times as many training samples than the data dimensionality; thus, parameter estimations (for PCA, ICA, KPCA, and the Bayesian method) were properly overconstrained.
The resulting five experimental trials were pooled to compute the mean and standard deviation of the recognition rates for each method. The fact that the training and testing sets had no overlap in terms of individual identities led to an evaluation of the algorithms’ generalization performance—the ability to recognize new individuals who were not part of the manifold computation or density modeling with the training set.
The baseline recognition experiments used a default manifold dimensionality of k = 20. This choice of k was made for two reasons: It led to a reasonable PCA reconstruction error of MSE = 0.0012 (or 0.12% per pixel with a normalized intensity range of [0,1]) and a baseline PCA recognition rate of «80% (on a different 50/50 partition of the dataset), thereby leaving a sizable margin for improvement. Note that because the recognition experiments were essentially a 140way classification task, chance performance was approximately 0.7%.
PCABased Recognition
The baseline algorithm for these face recognition experiments was standard PCA (eigenface) matching. The first eight principal eigenvectors computed from a single partition are shown in Fig. 2.10c. Projection of the test set probes onto the 20dimensional linear manifold (computed with PCA on the training set only) followed by nearestneighbor matching to the «140 gallery images using a Euclidean metric yielded a mean recognition rate of 77.31%, with the highest rate achieved being 79.62% (Table 2.1). The full imagevector nearestneighbor (template matching) (i.e., on x e R252) yielded a recognition rate of 86.46% (see dashed line in Fig. 2.11). Clearly, performance is degraded by the 252 ^ 20 dimensionality reduction, as expected.
ICABased Recognition
For ICAbased recognition (Architecture II, see Sect. 2.3.5) two algorithms based on fourthorder cumulants were tried: the “JADE” algorithm of Cardoso [5] and the fixedpoint algorithm of Hyvarinen and Oja [15]. In both algorithms a PCA whitening step (“sphering”) preceded the core ICA decomposition. The corresponding nonorthogonal JADEderived ICA basis is shown in Fig. 2.10d. Similar basis faces were obtained with the method of Hyvarinen and Oja. These basis faces are the columns of the matrix A in (2.14), and their linear combination (specified by the ICs) reconstructs the training data. The ICA manifold projection of the test set was obtained using y = A1 x. Nearestneighbor matching with ICA using the Euclidean L2 norm resulted in a mean recognition rate of 77.30% with the highest rate being 82.90% (Table 2.1). We found little difference between the two ICA algorithms and noted that ICA resulted in the largest performance variation in the five trials (7.66% SD). Based on the mean recognition rates it is unclear whether ICA provides a systematic advantage over PCA or whether “more nonGaussian” and/or “more independent” components result in a better manifold for recognition purposes with this dataset.
Note that the experimental results of Bartlett et al. [1] with FERET faces did favor ICA over PCA. This seeming disagreement can be reconciled if one considers the differences in the experimental setup and in the choice of the similarity measure. First, the advantage of ICA was seen primarily with more difficult timeseparated images. In addition, compared to the results of Bartlett et al. [1] the faces in this experiment were cropped much tighter, leaving no information regarding hair and face shape, and they were much lower in resolution, factors that when combined make the recognition task much more difficult.
Partition 
PCA 
ICA 
KPCA 
Bayes 
1 
78.00 
82.90 
83.26 
95.46 
2 
79.62 
77.29 
92.37 
97.87 
3 
78.59 
79.19 
88.52 
94.49 
4 
76.39 
82.84 
85.96 
92.90 
5 
73.96 
64.29 
86.57 
93.45 
Mean 
77.31 
77.30 
87.34 
94.83 
SD 
2.21 
7.66 
3.39 
1.96 
Table 2.1 Recognition accuracies with k = 20 subspace projections using fivefold cross validation.
The second factor is the choice of the distance function used to measure similarity in the subspace. This matter was further investigated by Draper et al. [10]. They found that the best results for ICA are obtained using the cosine distance, whereas for eigenfaces the L1 metric appears to be optimal; with L2 metric, which was also used in the experiments of Moghaddam [23], the performance of ICA (Architecture II) was similar to that of eigenfaces.
KPCABased Recognition
For KPCA, the parameters of Gaussian, polynomial, and sigmoidal kernels were first finetuned for best performance with a different 50/50 partition validation set, and Gaussian kernels were found to be the best for this data set. For each trial, the kernel matrix was computed from the corresponding training data. Both the test set gallery and probes were projected onto the kernel eigenvector basis (2.25) to obtain the nonlinear principal components which were then used in nearestneighbor matching of test set probes against the test set gallery images. The mean recognition rate was found to be 87.34%, with the highest rate being 92.37% (Table 2.1). The standard deviation of the KPCA trials was slightly higher (3.39) than that of PCA (2.21), but Fig. 2.11 indicates that KPCA does in fact do better than both PCA and ICA, hence justifying the use of nonlinear feature extraction.
MAPBased Recognition
For Bayesian similarity matching, appropriate training Δs for the two classes Ωι (Fig. 2.10b) and ΩΕ (Fig. 2.10a) were used for the dual PCAbased density estimates Ρ(Δ  Ωj) and Ρ(Δ  ΩΕ), which were both modeled as single Gaussians with subspace dimensions of kI and kE, respectively.
Fig. 2.11 Recognition performance of PCA, ICA, and KPCA manifolds versus Bayesian (MAP) similarity matching with a k = 20 dimensional subspace. Dashed line indicates the performance of nearestneighbor matching with the fulldimensional image vectors
The total subspace dimensionality k was divided evenly between the two densities by setting for modeling.8
With k = 20, Gaussian subspace dimensions of kI = 10 and kE = 10 were used forrespectively. Note thatthus matching the total number of projections used with the three principal manifold techniques. Using the maximum a posteriori (MAP) similarity in (2.9), the Bayesian matching technique yielded a mean recognition rate of 94.83%, with the highest rate achieved being 97.87% (Table 2.1). The standard deviation of the five partitions for this algorithm was also the lowest (1.96) (Fig 2.11).
Compactness of Manifolds
The performance of various methods with different size manifolds can be compared by plotting their recognition rates R(k) as a function of the first k principal components. For the manifold matching techniques, this simply means using a subspace dimension of k (the first k components of PCA/ICA/KPCA), whereas for the Bayesian matching technique this means that the subspace Gaussian dimensions should satisfy kI + kE = k. Thus all methods used the same number of subspace projections. This test was the premise for one of the key points investigated by Moghaddam [23]: Given the same number of subspace projections, which of these techniques is better at data modeling and subsequent recognition? The presumption is that the one achieving the highest recognition rate with the smallest dimension is preferred.
Fig. 2.12 Recognition accuracy R(k) of PCA, KPCA, and Bayesian similarity with increasing dimensionality k of the principal subspace. ICA results, not shown, are similar to those of PCA
For this particular dimensionality test, the total data set of 1829 images was partitioned (split) in half: a training set of 353 gallery images (randomly selected) along with their corresponding 594 probes and a testing set containing the remaining 353 gallery images and their corresponding 529 probes. The training and test sets had no overlap in terms of individuals’ identities. As in the previous experiments, the test set probes were matched to the test set gallery images based on the projections (or densities) computed with the training set. The results of this experiment are shown in Fig. 2.12, which plots the recognition rates as a function of the dimensionality of the subspace k. This is a more revealing comparison of the relative performance of the methods, as compactness of the manifolds—defined by the lowest acceptable value of k—is an important consideration in regard to both generalization error (overfitting) and computational requirements.
Discussion
The relative performance of the principal manifold techniques and Bayesian matching is summarized in Table 2.1 and Fig. 2.11. The advantage of probabilistic matching over metric matching on both linear and nonlinear manifolds is quite evident («18% increase over PCA and «8% over KPCA). Note that the dimensionality test results in Fig. 2.12 indicate that KPCA outperforms PCA by a «10% margin, and even more so with only few principal components (a similar effect was reported by Scholkopf et al. [32] where KPCA outperforms PCA in lowdimensional manifolds). However, Bayesian matching achieves «90% with only four projections— two for each Ρ(Δ  Ω)—and dominates both PCA and KPCA throughout the entire range of subspace dimensions in Fig. 2.12.
A comparison of the subspace techniques with respect to multiple criteria is shown in Table 2.2. Note that PCA, KPCA, and the dual subspace density estimation are uniquely defined for a given training set (making experimental comparisons repeatable), whereas ICA is not unique owing to the variety of techniques used to compute the basis and the iterative (stochastic) optimizations involved. Considering the relative computation (of training), KPCA required «7 x 109 floatingpoint operations compared to PCA’s «2 x 108 operations. On the average, ICA computation was one order of magnitude larger than that of PCA. Because the Bayesian similarity method’s learning stage involves two separate PCAs, its computation is merely twice that of PCA (the same order of magnitude).

PCA 
ICA 
KPCA 
Bayes 
Accuracy 
77% 
77% 
87% 
95% 
Computation 
10^{8} 
10^{9} 
10^{9} 
10^{8} 
Uniqueness 
Yes 
No 
Yes 
Yes 
Projections 
Linear 
Linear 
Nonlinear 
Linear 
Table 2.2 Comparison of the subspace techniques across multiple attributes (k = 20)
Considering its significant performance advantage (at low subspace dimensionality) and its relative simplicity, the dualeigenface Bayesian matching method is a highly effective subspace modeling technique for face recognition. In independent FERET tests conducted by the U.S. Army Laboratory [31], the Bayesian similarity technique outperformed PCA and other subspace techniques, such as Fisher’s linear discriminant (by a margin of at least 10%). Experimental results described above show that a similar recognition accuracy can be achieved using mere “thumbnails” with 16 times fewer pixels than in the images used in the FERET test. These results demonstrate the Bayesian matching technique’s robustness with respect to image resolution, revealing the surprisingly small amount of facial detail required for high accuracy performance with this learning technique.
Methodology and Usage
In this section, we discuss issues that require special care from the practitioner, in particular, the approaches designed to handle database with varying imaging conditions. We also present a number of extensions and modifications of the subspace methods.
Multiple ViewBased Approach for Pose
The problem of face recognition under general viewing conditions (change in pose) can also be approached using an eigenspace formulation. There are essentially two ways to approach this problem using an eigenspace framework. Given M individuals under C different views, one can do recognition and pose estimation in a universal eigenspace computed from the combination of MC images. In this way, a single parametric eigenspace encodes identity as well as pose. Such an approach, for example, has been used by Murase and Nayar [28] for general 3D object recognition.
Fig. 2.13 Parametric versus viewbased eigenspace methods. a Reconstructions of the input image (left) with parametric (middle) and viewbased (right) eigenspaces. Top: training image; bottom: novel (test) image. b Difference in the way the two approaches span the manifold
Alternatively, given M individuals under C different views, we can build a viewbased set of C distinct eigenspaces, each capturing the variation of the M individuals in a common view. The viewbased eigenspace is essentially an extension of the eigenface technique to multiple sets of eigenvectors, one for each combination of scale and orientation. One can view this architecture as a set of parallel observers, each trying to explain the image data with their set of eigenvectors. In this viewbased, multipleobserver approach, the first step is to determine the location and orientation of the target object by selecting the eigenspace that best describes the input image. This can be accomplished by calculating the likelihood estimate using each viewspace’s eigenvectors and then selecting the maximum.
The key difference between the viewbased and parametric representations can be understood by considering the geometry of face subspace, illustrated in Fig. 2.13b. In the highdimensional vector space of an input image, multipleorientation training images are represented by a set of C distinct regions, each defined by the scatter of M individuals. Multiple views of a face form nonconvex (yet connected) regions in image space [3]. Therefore, the resulting ensemble is a highly complex and nonseparable manifold.
The parametric eigenspace attempts to describe this ensemble with a projection onto a single lowdimensional linear subspace (corresponding to the first k eigenvectors of the MC training images). In contrast, the viewbased approach corresponds to C independent subspaces, each describing a particular region of the face subspace (corresponding to a particular view of a face). The principal manifold vc of each region c is extracted separately. The relevant analogy here is that of modeling a complex distribution by a single cluster model or by the union of several component clusters. Naturally, the latter (viewbased) representation can yield a more accurate representation of the underlying geometry.
This difference in representation becomes evident when considering the quality of reconstructed images using the two methods. Figure 2.13 compares reconstructions obtained with the two methods when trained on images of faces at multiple orientations. In the top row of Fig. 2.13a, we see first an image in the training set, followed by reconstructions of this image using first the parametric eigenspace and then the viewbased eigenspace. Note that in the parametric reconstruction, neither the pose nor the identity of the individual is adequately captured. The viewbased reconstruction, on the other hand, provides a much better characterization of the object. Similarly, in the bottom row of Fig. 2.13a, we see a novel view (+68°) with respect to the training set (—90° to +45°). Here, both reconstructions correspond to the nearest view in the training set (+45°), but the viewbased reconstruction is seen to be more representative of the individual’s identity. Although the quality of the reconstruction is not a direct indicator of the recognition power, from an informationtheoretical pointofview, the multiple eigenspace representation is a more accurate representation of the signal content.
Fig. 2.14 Multiview face image data used in the experiments described in Sect. 2.6.1.
The viewbased approach was evaluated [25] on data similar to that shown in Fig. 2.14 which consisted of 189 images: nine views of 21 people. The viewpoints were evenly spaced from —90° to +90° along the horizontal plane. In the first series of experiments, the interpolation performance was tested by training on a subset of the available views (±90°, ±45°, 0°) and testing on the intermediate views (±68°, ±23°). A 90% average recognition rate was obtained. A second series of experiments tested the extrapolation performance by training on a range of views (e.g., —90° to +45°) and testing on novel views outside the training range (e.g., +68° and +90°). For testing views separated by ±23° from the training range, the average recognition rate was 83%. For ±45° testing views, the average recognition rate was 50%.
Fig. 2.15 Modular eigenspaces. a Rectangular patches whose appearance is modeled with eigenfeatures. b Performance of eigenfaces, eigenfeatures, and the layered combination of both as a function of subspace dimension.
Modular Recognition
The eigenface recognition method is easily extended to facial features [30], as shown in Fig. 2.15a. This leads to an improvement in recognition performance by incorporating an additional layer of description in terms of facial features. This can be viewed as either a modular or layered representation of a face, where a coarse (lowresolution) description of the whole head is augmented by additional (higher resolution) details in terms of salient facial features. Pentland et al. [30] called the latter component eigenfeatures. The utility of this layered representation (eigenface plus eigenfeatures) was tested on a small subset of a large face database: a representative sample of 45 individuals with two views per person, corresponding to different facial expressions (neutral vs. smiling). This set of images was partitioned into a training set (neutral) and a testing set (smiling). Because the difference between these particular facial expressions is primarily articulated in the mouth, this feature was discarded for recognition purposes.
Figure 2.15b shows the recognition rates as a function of the number of eigenvectors for eigenfaceonly, eigenfeature only, and the combined representation. What is surprising is that (for this small dataset at least) the eigenfeatures alone were sufficient to achieve an (asymptotic) recognition rate of 95% (equal to that of the eigenfaces).
More surprising, perhaps, is the observation that in the lower dimensions of eigenspace eigenfeatures outperformed the eigenface recognition. Finally, by using the combined representation, one gains a slight improvement in the asymptotic recognition rate (98%). A similar effect was reported by Brunelli and Poggio [4], where the cumulative normalized correlation scores of templates for the face, eyes, nose, and mouth showed improved performance over the faceonly templates.
A potential advantage of the eigenfeature layer is the ability to overcome the shortcomings of the standard eigenface method. A pure eigenface recognition system can be fooled by gross variations in the input image (e.g., hats, beards). However, the featurebased representation may still find the correct match by focusing on the characteristic nonoccluded features (e.g., the eyes and nose).
Recognition with Sets
An interesting recognition paradigm involves the scenario in which the input consists not of a single image but of a set of images of an unknown person. The set may consist of a contiguous sequence of frames from a video or a noncontiguous, perhaps unordered, set of photographs extracted from a video or obtained from individual snapshots. The former case is discussed in Chap. 13 (recognition from video). In the latter case, which we consider here, no temporal information is available. A possible approach, and in fact the one often taken until recently, has been to apply standard recognition methods to every image in the input set and then combine the results, typically by means of voting.
However, a large set of images contains more information than every individual image in it: It provides clues not only on the possible appearance on one’s face but also on the typical patterns of variation. Technically, just as a set of images known to contain an individual’s face allows one to represent that individual by an estimated intrinsic subspace, so the unlabeled input set leads to a subspace estimate that represents the unknown subject. The recognition task can then be formulated in terms of matching the subspaces.
One of the first approaches to this task has been the mutual subspace method (MSM) [41], which extracts the principal linear subspace of fixed dimension (via PCA) and measures the distance between subspaces by means of principal angles (the minimal angle between any two vectors in the subspaces). MSM has the desirable feature that it builds a compact model of the distribution of observations. However, it ignores important statistical characteristics of the data, as the eigenvalues corresponding to the principal components, as well as the means of the samples, are disregarded in the comparison. Thus its decisions may be statistically suboptimal.
A probabilistic approach to measuring subspace similarity has been proposed [33]. The underlying statistical model assumes that images of the jth person’s face have probability density pj ; the density of the unknown subject’s face is denoted by p0. The task of the recognition system is then to find the class label j *, satisfying
Therefore, given a set of images distributed by p0, solving (2.26) amounts to choosing optimally between M hypotheses of the form in statistics is sometimes referred to as the twosample hypothesis: that two sets of examples come from the same distribution. A principled way to solve this task is to choose the hypothesis j for which the KullbackLeibler divergence between p0 and pj is minimized.
In reality, the distributions pj are unknown and must be estimated from data, as well as p0. Shakhnarovich et al. [33] modeled these distributions as Gaussians (one per subject), which are estimated according to the method described in Sect. 2.3.2. The KL divergence is then computed in closed form. In the experiments reported by these authors [33], this method significantly outperformed the MSM.
Modeling the distributions by a single Gaussian is somewhat limiting; Wolf and Shashua [40] extended this approach and proposed a nonparametric discriminative method: kernel principal angles. They devised a positive definite kernel that operates on pairs of data matrices by projecting the data (columns) into a feature space of arbitrary dimension, in which principal angles can be calculated by computing inner products between the examples (i.e., application of the kernel). Note that this approach corresponds to nonlinear subspace analysis in the original space; for instance, one can use polynomial kernels of arbitrary degree. In experiments that included a face recognition task on a set of nine subjects, this method significantly outperformed both MSM and the Gaussianbased KLdivergence model of Shakhnarovich et al. [33].
Conclusions
Subspace methods have been shown to be highly successful in face recognition, as they have in many other vision tasks. The exposition in this topic roughly follows the chronologic order in which these methods have evolved. Two most notable directions in this evolution can be discerned: (1) the transition from linear to general, possibly nonlinear, and disconnected manifolds; and (2) the introduction of probabilistic and specifically Bayesian methods for dealing with the uncertainty and with similarity. All of these methods share the same core assumption: that ostensibly complex visual phenomena such as images of human faces, represented in a highdimensional measurement space, are often intrinsically lowdimensional. Exploiting this low dimensionality allows a face recognition system to simplify computations and to focus the attention on the features of the data relevant for the identity of a person.