Face Alignment Models (Face Image Modeling and Representation) (Face Recognition) Part 3

Iterative Model Refinement

Local searches on their own, however, are prone to spurious matches due to noisy data and unmodelled image properties. To ensure that the estimated shape agrees with the statistical model learned from training data (see Sect. 5.1.1), we regularise our solution by fitting the shape model to the local matches. Hopefully, this regularised estimate is closer to the true solution such that repeating the search-regularise cycle gives progressively better estimates (Algorithm 5.6). Since each point also has a quality of match score, given by (5.24), these scores can be used to weight points differently during model fitting, as in (5.8), according to our belief in their reliability [30].

Multi-Resolution Active Shape Models

To avoid local minima when searching the image, it is useful to smooth the error function in early stages and reduce the level of smoothing gradually with each iteration.

Successful search for a face using the Active Shape Model

Fig. 5.9 Successful search for a face using the Active Shape Model


Failure of the Active Shape Model to localise a face where the search profiles are not long enough to locate the edges of face

Fig. 5.10 Failure of the Active Shape Model to localise a face where the search profiles are not long enough to locate the edges of face

In practice, we apply this smoothing by implementing the ASM in a multiresolution framework using a Gaussian image pyramid. This involves first searching for the object in a coarse image, then refining the shape in a series of progressively finer resolution images. Not only is this more robust to local minima but also more efficient, since less complex models can be used at the coarse levels of the pyramid.

Examples of ASM Search

In one example of an ASM search to locate the features of a face (Fig. 5.9), we place the model instance near the centre of the image and perform a coarse to fine search, starting on the 3rd level of a Gaussian pyramid (1/8 the resolution in x and y compared to the original image). In the first few iterations, large improvements are made that get the position and scale roughly correct. As the search progresses, however, more subtle adjustments to the shape are made using the finer resolution images. After 18 iterations (with at most 10 iterations per pyramid level), the process has converged and gives a good match to the target image.

In another example, the ASM fails to localise the face (Fig. 5.10). This is most likely due to the initialisation being too far from the true solution such that the correct feature positions are beyond the scope of the local search and the process falls into a local minimum.

Further Reading

The Active Shape Model can be viewed as a specific example of a ‘constrained local model’ (CLM)—a class of algorithms that perform a local search for each feature (based on an independent set of learned texture models) then fit a learned shape model to the set of local matches. Addressing susceptibility to local minima, however, has been a driving force for various modifications to the match metric and search algorithm.

Although profile gradients have proven to be effective for local search, discriminative models of profile intensity can distinguish between correct and incorrect matches and improve performance further [59]. Better still, using 2D patches instead of 1D profiles makes the model even more discriminative [44], based on measures such as normalised correlation [21], boosted classification [20] or mixtures of linear experts [50] to define a match score. Where to look for potential matches is usually defined by hand (e.g., a rectangular or elliptical grid) but may also be learned from training data [38].

Once a response surface (that is, the set of match scores for all candidate locations) has been computed for each point, the ASM naïvely picks the best match for each point before projecting the set of matches back onto the subspace of permitted shapes. Effectively, this approximates each response surface by a Gaussian likelihood function with diagonal covariance; by including off-diagonal terms, we can model directional uncertainty and can further improve performance [41, 45]. If the response surface is not approximated by a parametric function, or is approximated by a complex function such as a mixture of Gaussians [28] or nonparametric kernel density estimate [51], the match function may be optimised using iterative methods such as gradient-free optimisers such as the Nelder-Mead Simplex method [21] or mean-shift [51].

When using a PCA model of shape, each feature imposes constraints on every other feature such that computational limitations force us to select the match for each point independently of all other points. By assuming conditional independence between features, however, we can reduce the complexity of the graph and use Markov Random Field methods at little or no cost in efficiency [37]. Simplifying the graph in this way allows us to consider multiple candidates for each feature point and therefore increase robustness by avoiding local minima due to spurious matches that do not agree with the possible matches for other feature points. When choosing which dependencies to eliminate, trees [25] and k-fans [17] are popular due to their simplicity though more effective graph structures may be learned from training data [29].

Active Appearance Models (AAMs)

One criticism of the approaches related to the ASM is that they use only sparse local information around the points of interest. In addition, they often treat the information at each point as independent which is rarely the case. These criticisms are largely addressed by the following approach—dubbed the Active Appearance Model (AAM) [14]—that uses a combined model of appearance (Sect. 5.1.3) for image interpretation via the interpretation through synthesis paradigm: if we can find appearance model parameters which synthesise a face very similar to that in the target image, those parameters summarise the shape and texture of the face and can therefore be used directly for interpretation. In contrast to the Active Shape Model, the Active Appearance Model directly predicts incremental updates to appearance parameters from image residuals rather than performing a local search, making the method very efficient.

Algorithm 5.7: Image residual computation

Image residual computation

Goodness of Fit

Given combined appearance model parameters, c, a set of pose parameters, t, and a set of texture normalisation parameters, u, we can concatenate the parameters into a single vector,tmpdece406_thumb[2][2]synthesise  a new face image and compute the residual,tmpdece407_thumb[2][2]with  respect to the observed data. We then assess the quality of the synthesis (Algorithm 5.7) by some function of r(p), such as the sum of squared error,

tmpdece410_thumb[2][2]

as used in our examples. Like the ASM, we also can make assumptions about the distributions of residuals to estimate p(r | p) and place the matching in a Bayesian framework [9].

Updating Model Parameters

Given one estimate of the parameters,tmpdece411_thumb[2][2](where    Sp is our displacement

from the true solution, p*), and the corresponding residual, r(p), we then want to modify the parameters by Sp to minimisetmpdece412_thumb[2][2]Though we could do this via gradient descent [33], the AAM instead assumes that Sp can be predicted linearly from the residual vector such thattmpdece415_thumb[2][2]In    this    section, we present two approaches to learning the matrix R from training data that consists of random parameter displacements,tmpdece416_thumb[2][2](stored in the columns of a matrix, C), and the corresponding residuals,tmpdece417_thumb[2][2](stored    in    the columns of a matrix, V).

Since we want the model to be independent of the background in the training images, perturbed texture samples that include pixels from the background must be accounted for when building the model. One approach is to remove background pixels from the update model though we use the simpler alternative of setting background pixels to some random value.

Estimating R via Linear Regression

Given parameter displacements, C, and the corresponding image residuals, V, a linear update relationship gives

tmpdece421_thumb[2][2]

where Vt is the pseudo-inverse of V [12].

Unless, however, there are more displacements than pixels modelled (a rare occurrence) the model will overfit to the training data. To address this problem, applying PCA to reduce the dimensionality of the residuals (and effectively increase sampling density) before performing the regression has been shown to reduce overfitting and improve performance [31]. Alternatively, rather than projecting onto a lower dimensional subspace that maximises the variance of the projected inputs (that is, image residuals), Canonical Component Analysis (CCA) improves performance further [22] by computing subspaces for both inputs and outputs (that is, parameter displacements) that maximises the correlation between their respective projections.

Estimating R via Gauss-Newton Approximation

An alternative way to avoid overfitting is suggested by the first order Taylor expansion,

tmpdece422_thumb[2][2]

where the ij th element of the matrixtmpdece423_thumb[2][2]such thattmpdece424_thumb[2][2]is  minimised with respect to Sp by the RMS solution,

tmpdece427_thumb[2][2]

In a standard optimisation scheme, we would recalculate §p at every step— a computationally expensive operation. Since it is being computed in a normalised reference frame, however, we assume that it is approximately fixed and can be precomputed from the training set [14]. In practice, we express (5.27) in terms of the training data, C and V, to give

tmpdece428_thumb[2][2]

where || · ||F denotes the Frobenius norm. This Gauss-Newton approximation is popular because computing the pseudoinverse Ct is usually quicker and more robust than computing Vt due to their relative sizes. We then precompute R via (5.28) and use it in all subsequent image searches. To ensure a reliable estimate, we measure residuals at displacements of differing magnitudes (typically up to 0.5 standard deviations of each parameter) and combine them by smoothing with a Gaussian kernel. Qualitatively, computing the update via a Gauss-Newton approximation should be more stable, has a clearer mathematical interpretation and allows extra constraints to be incorporated easily [9]. Quantitatively, however, tests comparing the different approaches [7] have shown that using linear regression gives better localisation performance.

Iterative Model Refinement

Given an initial estimation of the model parameters, c, the pose, t, and the texture transformation, u, we repeatedly apply (5.28) to update model parameters based on the measured residual, r, giving estimates that get progressively closer to the true solution (Algorithm 5.8).

When we update the parameter vectortmpdece429_thumb[2][2]the simplest approach is

to subtract a the predicted displacementtmpdece430_thumb[2][2]such thattmpdece431_thumb[2][2]

The update step, however, estimates corrections in the model frame which must then be projected into the image frame using the current pose and texture transformations. Strictly speaking, therefore, we should update the parameters controlling the pose, t, and texture transformation, u, by composing the resulting transformations (during both training and image search). In other words, we should compute pose parameters t such that tmpdece432_thumb[2][2]and new texture transformation parameters u’

such thattmpdece433_thumb[2][2]where    updates    are    applied in the model frame before transforming to the image frame.

Multi-Resolution Active Appearance Models

As in the Active Shape Model, we estimate the appearance models and update matrices at a range of image resolutions using a Gaussian image pyramid. We can then use a multi-resolution search algorithm in which we start at a coarse resolution and iterate to convergence at each level before projecting the current solution to the next level of the model [33]. This is more efficient and can converge to the correct solution from further away than search at a single resolution. Computationally, the complexity of the AAM at a given level is O(nmodes · ^ixels) since each iteration samples npixels points from the image then multiplies by a nmodes x ^ixel matrix.

Algorithm 5.8: Active Appearance Model (AAM) fitting

Algorithm 5.8: Active Appearance Model (AAM) fitting

Examples of AAM Search

When using an AAM to localise a face in a previously unseen image, the algorithm typically requires fewer than 20 iterations to converge to a faithful reproduction of the face (Fig. 5.11). Like the ASM, however, the AAM is prone to local minima if started too far from the true solution (Fig. 5.12).

Search using the Active Appearance Model on faces not in the training set, showing evolution of the shape and the final image reconstruction. Initial iterations are performed using a low resolution model and resolution increases as the search progresses

Fig. 5.11 Search using the Active Appearance Model on faces not in the training set, showing evolution of the shape and the final image reconstruction. Initial iterations are performed using a low resolution model and resolution increases as the search progresses

Example of AAM search failure where the initialisation was too far from true position. The model has matched the eye and eyebrow to the wrong side of the face, and attempted to explain the dark background by shading one side of the reconstructed face

Fig. 5.12 Example of AAM search failure where the initialisation was too far from true position. The model has matched the eye and eyebrow to the wrong side of the face, and attempted to explain the dark background by shading one side of the reconstructed face

Alternative Strategies

Following the Active Appearance Model, a variety of related approaches to matching models of shape and texture have been suggested. Here, we summarise some of the key contributions.

Shape AAM

Though combining shape and appearance parameters has its uses in capturing correlations, treating the parameters separately can have computational benefits. Consider the case where we use the residuals to update only the pose, t, and shape model parameters, bs, such that

tmpdece442_thumb[2][2]

where the model texture, gm, is now simply the projection of the normalised sample, gs, onto the texture subspace (since shape and texture are treated independently). In this case,

tmpdece443_thumb[2][2]

wheretmpdece444_thumb[2][2]can be precomputed such that the texture model is required only to compute the texture error for the purposes of detecting convergence. Using a fixed number of iterations or changes in the shape parameters, however, dispenses with the texture model altogether and results in a much faster (though less accurate) algorithm. If required, a combined model of shape and texture can be used to apply post hoc constraints to the relative shape and texture parameter vectors by projecting them into the combined appearance space. This approach, known as the ‘Shape AAM’ [13], is closely related to the ‘Active Blob’ method [52] that uses an elastic deformation model rather than a statistical model of shape.

Compositional Approach

As noted earlier (Sect. 5.3.3), pose and texture transformation parameters should be updated via composition (rather than addition) and it can be shown that there are benefits from updating shape parameters in the same way [42]. If we consider (5.4) as a parameterised transformation of the mean shape,tmpdece445_thumb[2][2]then    we need to

find parameters, b’, such thattmpdece446_thumb[2][2]for example by approximating the transformation with a thin-plate spline (Algorithm 5.9). Using the inverse compositional image alignment algorithm [1] improves efficiency further by specifying Jacobians and Hessians as functions of template images (rather than sampled images) such that they can be precomputed, thus saving computation at run-time. Also decoupling shape from texture for efficiency, the resulting inverse compositional AAM [42] has demonstrated model fitting at speeds of up to 200 frames per second.

Algorithm 5.9: Compositional AAM fitting with a Thin Plate Spline [6]Algorithm 5.9: Compositional AAM fitting with a Thin Plate Spline [6]

Further Reading

Since their introduction, Active Appearance Models have spawned many variants [26] and also demonstrated considerable success in medical image analysis (for which, software is publicly available [55]). In addition to the two variants already described (Sect. 5.3.6), other modifications include methods for expressing the update matrix, R, as a function of the current residual for improved convergence [2] and sequential implementations that tune the training data to match the expected error distribution [49].

Predicting parameter updates via nonlinear regression has also been proposed, where boosting a number of weak regressors is currently popular [48, 63]. Using gradient descent-based algorithms to minimise an error metric learned from training data has also shown promise [39], as has selecting updates via a pairwise comparison of two potential candidates [62].

Conclusions

In this topic, we have described powerful statistical models of the shape and texture of faces that are capable of synthesising a wide range of convincing face images. Algorithms such as the Active Shape Model (ASM) and Active Appearance Model (AAM) rapidly fit these appearance models to unseen image data such that the parameters capture the underlying properties of the face, isolating those sources of variation that are essential to face recognition (that is, identity) from those that are not (e.g., expression).

One weakness of both the ASM and AAM (and their variations) is that they are local optimisation techniques and tend to fall into local minima if initialisation is poor. Where independent estimates of feature point positions are available (e.g., from an eye tracker) these can be incorporated into the matching schemes and lead to more reliable matching [9].

These approaches also rely on an annotated corpus of training data and therefore can only deal effectively with certain types of variation in appearance. For example, person-specific variation that cannot be corresponded (e.g., wrinkles on the forehead or the appearance of moles) tends to get blurred out by the averaging process inherent in the modelling. This suggests that these methods may be improved by adding further layers of information to the model in order to represent individual differences which are poorly represented as a result of pooling in the current models.

Open questions (some of which are currently under investigation) include:

•    How do we obtain accurate correspondences across the training set?

•    What is the optimal choice of model size and number of model modes?

•    How should image structure be represented?

•    What is the best method of matching the model to the image?

•    How do we avoid local minima in the error surface?

Next post:

Previous post: