Multi-Modal Classifier-Fusion for the Recognition of Emotions - Coverbal Synchrony in Human-Machine Interaction

Graphics Reference

In-Depth Information

motion of the face is indicative for pose changes and non-verbal

communication signals, e.g., head movements during nodding or

selective shifts of attention through pose changes. On the other hand,on

the local scale, the internal facial motions are indicative of fine-grained

changes in the facial expression and emotional exposition. Examples

are, e.g., eye blinks, smiles or mouth openings. We reasoned that the

segregation of this information should be helpful to further improve

the analysis of emotion data and, thus, process the visual input

stream along three independent pathways. In order to make use of

more detailed task-related information, we propose here an extended

model architecture which aims at first segregating form and motion,

as briefly outlined above, and further subdivides the motion stream

into separate representations of global and local motion, respectively.

An overview of the outline of the architecture is presented in Figure 7.

Motion and form features are processed along two separate pathways,

composed of alternating layers of filtering (S) and non-linear pooling

(C) stages. In layer S1, different scale representations of the input

image are convolved with 2D Gabor filters of different orientations

(form path) and a spatio-temporal correlation detector is used to build

a discrete velocity space representation (motion path). The initial

motion representation is then further subdivided to build separate

representations of global and local facial motion. Global motion is

approximated by the best-fit affine motion. To achieve this, the facial

region is detected by searching for horizontally oriented barcode like

structures within a Gabor-filtered input image (Dakin and Watt, 2009)

which is refined into facial regions-of-interests around eyes, nose

and mouth. These regions are excluded from the successive random

sampling process used for the estimation of the affine transformation

parameters representing the global flow (affine flow).The residual or

local flow is then calculated by subtracting the affine flow from the

unmodified flow to provide the input representation for extracting

local motion responses. All three streams, or channels, are then further

processed in parallel by hierarchical stages of alternating S- and

C-filtering steps. Layer C1 cells pool the activities of S1 cells of the

same orientation (direction) over a small local neighborhood and two

neighboring scales and speeds, respectively. The layer S2 is created

by a simple template matching of patches of C1 activities against a

number of prototype patches. These prototypes are randomly selected

during the learning stage (for details, see (Mutch and Lowe, 2008)). In

the final layer C2, the S2 prototype responses are again pooled over a

limited neighborhood and combined into a single feature vector which

serves as input to the successive classification stage.

Search WWH ::

Custom Search

Home