Face Detection (Face Recognition Techniques) Part 2

Learning Weak Classifiers

As mentioned earlier, the AdaBoost learning procedure is aimed at learning a sequence of weak classifiers hm(x) and the combining weights am in (11.1). It solves the following three fundamental problems: (1) learning effective features from a large feature set; (2) constructing weak classifiers, each of which is based on one of the selected features; and (3) boosting the weak classifiers to construct a strong classifier.

AdaBoost assumes that a “weak learner’ procedure is available. The task of the procedure is to select the most significant feature from a set of candidate features, given the current strong classifier learned thus far, and then construct the best weak classifier and combine it into the existing strong classifier. Here, the “significance” is with respect to some given criterion (see below).

In the case of discrete AdaBoost, the simplest type of weak classifiers is a “stump.” A stump is a single-node decision tree. When the feature is real-valued, a stump may be constructed by thresholding the value of the selected feature at a certain threshold value; when the feature is discrete-valued, it may be obtained according to the discrete label of the feature. A more general decision tree (with more than one node) composed of several stumps leads to a more sophisticated weak classifier.

For discrete AdaBoost, a stump may be constructed in the following way. Assume that we have constructed M — 1 weak classifierstmp35b0-317_thumb[2]  and we want to constructtmp35b0-318_thumb[2]The  stumptmp35b0-319_thumb[2]is determined by comparing the selected featuretmp35b0-320_thumb[2]with a threshold τk* as follows


tmp35b0-325_thumb[2]

In this form, hM(x) is determined by two parameters: the type of the scalar feature zk* and the threshold τk*. The two may be determined by some criterion, for example, (1) the minimum weighted classification error, or (2) the lowest false alarm rate given a certain detection rate.

Supposing we want to minimize the weighted classification error with real-valued features, then we can choose a thresholdtmp35b0-326_thumb[2]for each feature zk to minimize the corresponding weighted error made by the stump with this feature; we then choose the best feature zk* among all k that achieves the lowest weighted error.

Finding the optimal threshold of a weak classifier

Fig. 11.6 Finding the optimal threshold of a weak classifier

Wu et al. show in [47] that only N +1 possible n values need to be evaluated, and evaluating each τk is only O(1) if we sort the feature values for zk beforehand. This method to find the optimal threshold of a weak classifier is illustrated in Fig. 11.6.

Suppose the features are sorted in the ordertmp35b0-329_thumb[2] satisfy thattmp35b0-330_thumb[2]then  setting the threshold to eithertmp35b0-331_thumb[2]will

result in the same weighted error. Thus, in Fig. 11.6 only the feature values (plus tmp35b0-332_thumb[2]are considered as possible thresholds. Given the weighted error attmp35b0-333_thumb[2]a simple update is sufficient to compute the error whentmp35b0-334_thumb[2]because at most one example has changed its classification result.

Most of the computations in Fig. 11.6 is spent in the initialization part. One important observation in [47] is that the feature values need to be computed and sorted only once, because they do not change during the AdaBoost process even though the weights W change at each iteration. By storing the sorted feature values for all features in a table, the AdaBoost training time is reduced from weeks ([45]) to hours ([47]).

More features usually lead to higher detection accuracy [27]. However, it also means that the table of sorted feature values may be too large to be stored in the main memory. Pham and Cham construct weak classifiers using the mean and standard deviation of feature values. Given a feature, its associated mask m, and the AdaBoost weightstmp35b0-335_thumb[2]the average feature value istmp35b0-336_thumb[2]where Xi

is a set of training examples.

The integral image trick can be used to accelerate the computation of the mean feature value and standard deviation, that is, providing a way to utilize the structures in the mask m and computes mTx quickly. Let x be an image subwindow in the stacked vector form and y be the corresponding integral image, it is clear from (11.2) that the transformation that generates y from x is linear, that is, there exists a square matrix B such thattmp35b0-345_thumb[2]The average feature value is then [27]

tmp35b0-347_thumb[2]

In (11.6), the weighted average integral imagetmp35b0-348_thumb[2]can be computed in linear time, and the transformed masktmp35b0-349_thumb[2]is    sparse because of the structure in the mask m. Thus, the average feature value can be computed very quickly. Similarly, the (weighted) standard deviation can be quickly computed, too.

The faces are then modeled as a one-dimensional Gaussian distribution

tmp35b0-350_thumb[2]are computed from face examples. Nonfaces are modeled bytmp35b0-351_thumb[2]similarly.    The    best    threshold    that    separate two 1-d Gaussians can then be solved in a closed form.

This method is faster than examining all possible τk values, and has a much smaller storage requirement [27]. It is reported in [27] that the training speed is about two times faster than the algorithm presented in Fig. 11.6. This method has much less storage requirements and thus can train a strong classifier with more local features.

A decision stump is simple but may not fully utilize the information contained in a feature. A more complex weak classifier can be constructed by using piecewise decision functions [14, 22]: dividing the range of feature values into k nonoverlapping cells, and learn a decision function (for example, a decision stump) for every cell. Piece-wise decision functions take longer training time but usually have higher discrimination power than simple decision stumps.

Supposing that we want to achieve the lowest false alarm rate given a certain detection rate, we can set a threshold τk for each zk so a specified detection rate (with respect totmp35b0-352_thumb[2]is achieved bytmp35b0-353_thumb[2]corresponding to a  pairtmp35b0-354_thumb[2]

Given this, the false alarm rate (also with respect totmp35b0-355_thumb[2]due to this newtmp35b0-356_thumb[2] can be calculated. The best pairtmp35b0-357_thumb[2]and  hencetmp35b0-358_thumb[2]is the one that minimizes the false alarm rate.

There is still another parameter that can be tuned to balance between the detection rate and the false alarm rate: The class label predictiontmp35b0-359_thumb[2]is obtained by thresholding the strong classifiertmp35b0-360_thumb[2]at    the    default threshold value 0.

However, it can be done astmp35b0-361_thumb[2]with another value Tm , which can be tuned for the balance.

The form of (11.5) is for Discrete AdaBoost. In the case of real-valued versions of AdaBoost, such as Real Boost and Logit Boost, a weak classifier should be real valued or output the class label with a probability value. For the real-value type, a weak classifier may be constructed as the log-likelihood ratio computed from the histograms of the feature value for the two classes. (See the literature for more details [17-19].) For the latter, it may be a decision stump or tree with probability values attached to the leaves [21].

Learning Strong Classifiers Using AdaBoost

AdaBoost learns a sequence of weak classifiers hm and boosts them into a strong one Hm effectively by minimizing the upper bound on classification error achieved by Hm . The bound can be derived as the following exponential loss function [32]

tmp35b0-376_thumb[2]

where i is the index for training examples. AdaBoost constructstmp35b0-377_thumb[2] 1,…,M) by stagewise minimization of (11.7). Given the currenttmp35b0-378_thumb[2] tmp35b0-379_thumb[2]and the newly learned weak classifier hM, the best combining coefficient aM for the new strong classifier .tmp35b0-380_thumb[2]minimizes the cost

tmp35b0-385_thumb[2]

The minimizer is

tmp35b0-386_thumb[2]

where εΜ is the weighted error rate

tmp35b0-387_thumb[2]

where 1 [C] is 1 if C is true but 0 otherwise.

Each example is reweighted after an iteration that is,is updated according to the classification performance of Hm :tmp35b0-388_thumb[2]

tmp35b0-390_thumb[2]

which is used for calculating the weighted error or another cost for training the weak classifier in the next round. This way, a more difficult example is associated with a larger weight so it is emphasized more in the next round of learning. The algorithm is summarized in Fig. 11.7.

Alternative Feature Selection Methods

In boosting based methods (cf. Fig. 11.7), the weak classifiers hM and their related weights aM are determined simultaneously: hM is chosen to minimize certain objective value (for example, weighted error rate of the feature) and aM is a function of the objective value. Wu et al. [47] show that if these tasks (learning hM and setting αΜ) are decoupled into two sequential steps, a more accurate strong classifier Hm can be obtained.

 AdaBoost learning algorithm

Fig. 11.7 AdaBoost learning algorithm

Different methods can be used to select features and train weak classifiers. Besides AdaBoost and other boosting variants, [47] showed that a greedy Forward Feature Selection (FFS) method can successfully select a subset of features from a large feature pool and learn corresponding weak classifiers. In boosting methods, a feature is selected if its corresponding weak classifier has minimum weighted error rate. FFS uses a different selection criterion that is directly related to the strong classifier’s performance. In FFS, if a partial strong classifier Hm— i is already constructed, a feature hM* is selected in iteration M only if it leads to highest strong classifier accuracy, that is,tmp35b0-392_thumb[2]has    the highest accuracy among all possible hM. FFS uses majority vote (that is, ai = 1 for all i). A table of feature values are stored to ensure fast weak classifier training. FFS trains faster than the Ad-aBoost method, and achieves comparable but slightly lower detection accuracy than AdaBoost.

In fact, it is shown that AdaBoost is a sequential forward search procedure using the greedy selection strategy to minimize a certain margin on the training set [32]. Conceptually, FFS and AdaBoost shares the greedy feature selection idea, although different objective functions are used to guide the greedy search procedures.

A crucial heuristic assumption used in such a sequential forward search procedure is the monotonicity (that is, that addition of a new weak classifier to the current set does not decrease the value of the performance criterion). The premise offered by the sequential procedure in AdaBoost or FFS breaks down when this assumption is violated. Floating Search [29] is a sequential feature selection procedure with backtracking, aimed to deal with nonmonotonic criterion functions for feature selection. The sequential forward floating search (SFFS) methods [29] adds or deletes a single (I = 1) feature and then backtracks r steps, where r depends on the current situation.

FloatBoost algorithm

Fig. 11.8 FloatBoost algorithm

The FloatBoost Learning procedure is shown in Fig. 11.8. It is composed of several parts: the training input, initialization, forward inclusion, conditional exclusion, and output. In step 2 (forward inclusion), the currently most significant weak classifiers are added one at a time, which is the same as in AdaBoost. In step 3 (conditional exclusion), FloatBoost removes the least significant weak classifier from the settmp35b0-395_thumb[2] of current weak classifiers, subject to the condition that the removal leads to a lower cost thantmp35b0-396_thumb[2]Supposing  that the weak classifier removed was the m’th intmp35b0-397_thumb[2] thentmp35b0-398_thumb[2]and  thetmp35b0-399_thumb[2]must be relearned. These steps are repeated until no more removals can be done.

Asymmetric Learning Methods

The face detection (and other object detection) problem is a rare-event detection problem [47], in the sense that the face (or target object) only occupies a small number of subwindows while the nonface (or nonobject) subwindows are on the order of millions even in a small-sized image. This asymmetric nature of the classifier learning problem is long recognized and methods have been proposed to build a strong classifier that takes into account the asymmetry property.

Viola and Jones proposed the AsymBoost method [44], which is a modification of the AdaBoost algorithm. The essence of AsymBoost is to focus more on positive examples by changing the weight update rule (11.11) to

tmp35b0-405_thumb[2]

wheretmp35b0-406_thumb[2]is a parameter that measures that level of asymmetry. We assume that the boosting learning procedure will repeat T rounds. In the T rounds of AdaBoost algorithm, positive examples are continuously assigned higher weights than negative examples.

Wu et al. propose another asymmetric learning method. Wu et al. [47] shows that it is advantageous to adjust the values of ai after the features are selected, according to the cascade detection framework. Assuming the false alarm rate of all strong classifiers in a cascade is 0.5, a 20-node cascade will have a 10-6 false alarm rate if we assume the strong classifiers reject nonfaces independent to each other. Thus, Wu et al. propose the following learning goal for the strong classifiers in a cascade: “for every node, design a classifier with very high (e.g. 99.9%) detection rate and only moderate (e.g., 50%) false positive rate" Linear Asymmetric Classifier (LAC) is designed to find the α that achieve this goal.

Given weak classifierstmp35b0-407_thumb[2]an    example x is mapped to a vector of responsestmp35b0-408_thumb[2]LAC computes the distributions of the

vectortmp35b0-409_thumb[2]are  mean and covariance matrix of h(x) when x is the set of faces. Similarly,tmp35b0-410_thumb[2]are the mean and covariance matrix computed using nonfaces. It is showed in [47] that the following LAC solution vectortmp35b0-411_thumb[2]is globally optimal for the cascade learning goal under certain reasonable assumptions:

tmp35b0-418_thumb[2]

Another way to set the a vector is to use the Fisher’s Discriminant Analysis (FDA). Experiments in [47] show that using LAC or FDA to set the α vector consistently improve cascade detection accuracy, no matter the weak classifiers are selected and trained using AdaBoost or FFS.

Next post:

Previous post: