Graphics Reference
In-Depth Information
motion of the face is indicative for pose changes and non-verbal
communication signals, e.g., head movements during nodding or
selective shifts of attention through pose changes. On the other hand,on
the local scale, the internal facial motions are indicative of fine-grained
changes in the facial expression and emotional exposition. Examples
are, e.g., eye blinks, smiles or mouth openings. We reasoned that the
segregation of this information should be helpful to further improve
the analysis of emotion data and, thus, process the visual input
stream along three independent pathways. In order to make use of
more detailed task-related information, we propose here an extended
model architecture which aims at first segregating form and motion,
as briefly outlined above, and further subdivides the motion stream
into separate representations of global and local motion, respectively.
An overview of the outline of the architecture is presented in Figure 7.
Motion and form features are processed along two separate pathways,
composed of alternating layers of filtering (S) and non-linear pooling
(C) stages. In layer S1, different scale representations of the input
image are convolved with 2D Gabor filters of different orientations
(form path) and a spatio-temporal correlation detector is used to build
a discrete velocity space representation (motion path). The initial
motion representation is then further subdivided to build separate
representations of global and local facial motion. Global motion is
approximated by the best-fit affine motion. To achieve this, the facial
region is detected by searching for horizontally oriented barcode like
structures within a Gabor-filtered input image (Dakin and Watt, 2009)
which is refined into facial regions-of-interests around eyes, nose
and mouth. These regions are excluded from the successive random
sampling process used for the estimation of the affine transformation
parameters representing the global flow (affine flow).The residual or
local flow is then calculated by subtracting the affine flow from the
unmodified flow to provide the input representation for extracting
local motion responses. All three streams, or channels, are then further
processed in parallel by hierarchical stages of alternating S- and
C-filtering steps. Layer C1 cells pool the activities of S1 cells of the
same orientation (direction) over a small local neighborhood and two
neighboring scales and speeds, respectively. The layer S2 is created
by a simple template matching of patches of C1 activities against a
number of prototype patches. These prototypes are randomly selected
during the learning stage (for details, see (Mutch and Lowe, 2008)). In
the final layer C2, the S2 prototype responses are again pooled over a
limited neighborhood and combined into a single feature vector which
serves as input to the successive classification stage.
Search WWH ::




Custom Search