Gabor-Like Image Filtering for Transient Feature Detection and Global Energy Estimation Applied to Multi-expression Classification (Computer Vision,Imaging and Computer Graphics) Part 1

Abstract. An automatic system for facial expression recognition should be able to recognize on-line multiple facial expressions (i.e. “emotional segments”) without interruption. The current paper proposes a new method for the automatic segmentation of “emotional segments” and the dynamic recognition of the corresponding facial expressions in video sequences. First, a new spatial filtering method based on Log-Normal filters is introduced for the analysis of the whole face towards the automatic segmentation of the “emotional segments”. Secondly, a similar filtering-based method is applied to the automatic and precise segmentation of the transient facial features (such as nasal root wrinkles and nasolabial furrows) and the estimation of their orientation. Finally, a dynamic and progressive fusion process of the permanent and transient facial feature deformations is made inside each “emotional segment” for a temporal recognition of the corresponding facial expression. When tested for automatic detection of “emotional segment” in 96 sequences from the MMI and Hammal-Caplier facial expression databases, the proposed method achieved an accuracy of 89%. Tested on 1655 images the automatic detection of transient features achieved a mean precision of 70 % with an error of 2.5 for the estimation of the corresponding orientation. Finally compared to the original model for static facial expression classification, the introduction of transient features and the temporal information increases the precision of the classification of facial expression by 12% and compare favorably to human observers’ performances.


Keywords: Facial expressions, Multiscale spatial filtering, Log-normal filters,Holistic processing, Feature based processing, Classification, TBM.

Introduction

Significant efforts have been made during the past two decades to improve the automatic recognition of facial expressions in order to understand and appropriately respond to the users intentions. Applied in every day life situations (e.g. pain monitoring), such a system must be sensitive to the temporal behavior of the human face and able to analyze consecutive facial expressions without interruption. Yet, few efforts have been made so far for the dynamic recognition of multiple facial expressions in video sequences. Indeed, most of the past work on facial expressions recognition focused on static classification or at best assume that there is only one expression in the studied sequences. Recent studies have investigated the temporal information for the recognition of facial expressions [2]. For example [3], [4], [5] introduced the temporal information for the recognition of Action Units (AUs) activation into 4 temporal segments (i.e. neutral, onset, apex, offset) in a predefined number of frames, while [6] introduced the temporal correlation between different AUs for their recognition. However, in our point in view these systems bypass the problem of facial expression recognition (which requires an additional processing step after detecting the AUs) and they do not allow to explicitly recognize more than one facial expression in a video sequence. Compared to these models, [7], [8] and [9], introduced the temporal information for facial expression recognition. However, the temporal information was mainly introduced in order to improve the systems’ performances. None of the proposed methods take explicitly into account the temporal dynamic of the facial features and their asynchronous deformation from the beginning to the end of the facial expressions. Moreover, all the proposed methods are either holistic (analysis of the whole texture of the face, [6], [9]) or feature-based (analysis of facial features information such as eyes, eyebrows and mouth, [3], [4], [5]), or at best combine the permanent and transient facial features (i.e. wrinkles in a set of selected areas, [7]) for the automatic recognition of facial expression. However, it has been established in psychology that holistic and feature-based processing are both engaged in facial expressions recognition [10]. Compared to these methods, the current contribution proposed a new video based method for facial expressions recognition, which exploits both holistic and feature-based processing. The proposed holistic processing is employed for the automatic segmentation of consecutive “emotional segments” (i.e. a set of consecutive frames corresponding to a facial muscles activation compared to a neutral state), and consists in the estimation of the global energy of the face by a multiscale spatial-filtering using Log-Normal filters. The feature-based processing consists in the dynamic and progressive analysis of permanent and transient facial feature behavior inside each emotional segment for the recognition of the corresponding facial expression. The dynamic and progressive fusion process allows dealing with asynchronous facial feature deformations. The permanent facial features information is measured by a set of characteristic points around the eyes, the eyebrows and the mouth based on the work of [11]. A new filtering-based method is proposed however for transient facial features detection. Compared to the commonly proposed canny based methods for wrinkles detection [12], [7], the proposed spatial filtering method provides a precise detection of the transient features and an estimation of their orientation in a single pass. The fusion of all the facial features information is based on the Transferable Belief Model (TBM) [13]. The TBM has already proved its suitability for facial expression classification [1] and for the explicit modeling of the doubt between expressions in the case of blends, combinations or uncertainty between two or several facial expressions. Given the critical factor of the temporal dynamics of facial features for facial expressions recognition, a dynamic and progressive fusion process of the permanent and of the transient facial features information (dealing with asynchronous behavior) is made inside each emotional segment from its beginning to its end based on the temporal modeling of the TBM.

Holistic and Feature Based Processing

Facial expression results from the contraction of groups of facial muscles. These contractions lead to the deformation of the permanent facial features (such as eyes, eyebrows and mouth) and of the skin texture leading to the appearance of transient features (such as nasolabial furrows and nasal root wrinkles) [14]. These deformations may be analyzed either separately (i.e. feature based processing) or all together (i.e. holistic processing).

Holistic Face Processing for Emotional Segment Detection

Emotional segments correspond to all the frames between each pair of beginning and end of each facial expression. Each emotional segment (i.e. one facial expression) is characterized by a set of facial muscle activation. This activation induces local changes in spatial frequencies and orientations of the face compared to the relaxation state (i.e. neutral) and can be measured by the energy response of a bank of filters at different frequencies and orientations on the whole face (i.e. holistic processing).

Log-normal Filtering. A holistic face processing technique based on a Log-Normal filtering process is used for dynamic detection of pairs of beginning and end of multiple emotional segments in video sequences (i.e. to segment the video sequence). To do that, the studied face is first automatically detected in video streams using the method proposed by [15] and tracked in the remaining of the sequence [11]. To cope with the problem of illumination variation, a preprocessing stage based on a model of the human retina [16] is applied to each detected face (see Fig. 1.b). This processing enhances the contours and realizes a local correction of the illumination variation. To take away the frame border information and to only measure the facial deformations, a Hamming circular window is applied to the filtered face (Fig. 1.b). The power spectrum of the obtained face area is then passed through a bank of Log-Normal filters (15 orientations and 2 central frequencies), leading to a collection of features measuring the amount of energy displayed by the face at different frequency bands and across all orientations (Fig. 1.c). The Log-Normal filters are chosen because of their advantage of being easily tuned and separable in frequency and orientation [17] which make them well suited for detecting features at different scales and orientations (see section 2.2). They are defined as follow:

tmp5839336_thumb

Wheretmp5839337_thumbis the transfer function of the filter,tmp5839338_thumbrespectively, represents the frequency and the orientation components of the filter; fi is the central frequency, θj, the central orientation, Or, the frequency bandwidth, Οθ, the orientation bandwidth and A, a normalization factor. The factor 1/ f in equation 1 accounts for the decrease of energy in function of the frequency, which in average follows 1/fα power law for faces. This factor ensures that the sampling of the spectral information of the face takes into account the specific distribution of energy of the studied face at different scales.

Emotional Segment. Once the filtering process done, facial muscle activity is measured by the energy of the obtained filters’ responses. The obtained results (Fig. 1.d) show high-energy response (white areas) around the permanent facial features (such as eyes, eyebrows and mouth) and transient facial features (such as nasolabial furrows and nasal root wrinkles). The amount of energy displayed by the face at high frequencies [22] and across all orientations is then summed and called global energy as follow:

tmp5839341_thumb

Wheretmp5839342_thumbis the global energy of the face andtmp5839343_thumbthe Fourier power spectrum of the current frame (expressed in polar coordinates).

(a) Input image, (b) After retinal filtering and multiplied with a hamming window, (c) Bank of Log-Normal filters, (d) Spatial response of Log-Normal filter during three facial expressions

Fig. 1. (a) Input image, (b) After retinal filtering and multiplied with a hamming window, (c) Bank of Log-Normal filters, (d) Spatial response of Log-Normal filter during three facial expressions

Fig. 2 shows examples of the temporal evolution of the global energy of different subjects and for different facial expressions going from neutral to the apex of the expression and coming back to neutral. These examples show that facial feature deformations effectively induce a change of the measured global energy. Similar evolutions can be observed for all the subjects independently of individual morphological differences and facial expressions. The global energy is then used to detect each emotional segment as the set of frames between each pair of beginning and end. The beginning of each facial expression is characterized by the increase of the global energy of the face and the end as the coming-back of this energy to its value at the beginning taken as a reference value.

Beginning. The detection of the beginning ( Fb ) of each emotional segment is computed based on the derivation of the global energy signaltmp5839347_thumbCharacterized by a quick change of the global energy, the beginning corresponds to a peak of its derivation compared to a relaxation state. The average mt of the derivation of the global energy and its standard deviation sdt from the beginning of the sequence (or from the end of a previous segment) until the current frame are computed progressively inside a progressive temporal window. The beginning Fb corresponds to the first frame verifying:

tmp5839349_thumb

The use of a temporal window prevents local errors due to punctual peaks in the derivation of the global energy (e.g. local peaks due to eye blink).

Time course of the global energy (normalized in amplitude and length) for 3 facial expressions and for 9 subjects from the Hammal-Caplier database. Red curves correspond to the mean curve of all the subjects.

Fig. 2. Time course of the global energy (normalized in amplitude and length) for 3 facial expressions and for 9 subjects from the Hammal-Caplier database. Red curves correspond to the mean curve of all the subjects.

End. The detection of the end of each emotional segment ( Fe ) is based on the global energy evolution and it is made after each beginning frame. To do so, the detection process begins 12 frames after the beginning of the segment (i.e. the minimum time necessary for a complete muscle activity (contraction and relaxation) [19]). A temporal sliding window of 6 frames (time for muscle contraction) is then used to measure the local average of the global energy signal. The end of the current segment corresponds to the first frame where the mean of the measured global energy in the sliding window is close enough to the energy at the relaxation state before the beginning of the current segment as:

tmp5839351_thumb

Where mGt and sdGt correspond to the mean and standard deviation of the global energy in the relaxation state (i.e. before the beginning of the current emotional segment).

At the end of each emotional segment all the characteristic parameters are reinitialized to the current energy values. The re-initialization process allows the progressive adaptation of the reference value of the global energy, relatively to the relaxation states between the consecutive emotional segments (see Fig. 4). Indeed, it is important to notice that the proposed method allows the detection of each pair ( Fb, Fe) on-line, without any post-processing step, and makes it independent of the absolute level of global energy that can be dependent of the expression intensity or face morphology. Fig. 3 shows examples of detection of emotional segments from the Hammal-Caplier and the MMI databases (the MMI-Facial Expression Database collected by M. Pantic and her group (www.mmifacedb.com), [19]). The automatic segmentation appears very comparable to a manual segmentation and robust to variable duration of the expressive segments.

The detection of the beginning and the end of each emotional segment (i.e. facial expression) can also be applied several times during a multi-expression sequence.

Example of detection of the beginning and the end on Hammal-Caplier (subjects 1 and 2) and MMI (subjects 3 and 4) databases. Dashed lined correspond to the automatic results and solid lines to expert manual segmentation.

Fig. 3. Example of detection of the beginning and the end on Hammal-Caplier (subjects 1 and 2) and MMI (subjects 3 and 4) databases. Dashed lined correspond to the automatic results and solid lines to expert manual segmentation.

Fig. 4 shows the evolution of the global energy during a sequence where the subject expressed 6 different facial expressions sequentially and the results of the segmentation of the different emotional segments. Each beginning is detected using equation 3 and the process starts either at the first frame of the sequence or at the frame following immediately the last detected end (using equation 4).

Example of automatic segmentation result of one sequence containing 6 expressive segments. Dashed lines correspond to each detected pair of beginning and end. Note that the global energy between two segments varies during the sequence.

Fig. 4. Example of automatic segmentation result of one sequence containing 6 expressive segments. Dashed lines correspond to each detected pair of beginning and end. Note that the global energy between two segments varies during the sequence.

The obtained result shows how the proposed method successfully detects the different emotional segments. At best of our knowledge, this is the first time where several facial expression segments are automatically detected in a video sequence. After the segmentation process each expressive segment is automatically and independently analyzed to recognize the corresponding facial expression based on a feature-based processing.

Next post:

Previous post: