Graphics Reference
In-Depth Information
features are concatenated to a 10-dimensional early fusion feature
vector.
￿ The Mel frequency cepstral coefficient (MFCC) representation is
inspired by the biological known perceptual variations in the
human auditory system. The perception is modeled using a
filter bank with filters linearly spaced in lower frequencies and
logarithmically in higher frequencies in order to capture the
phonetically important characteristics of speech. The MFCCs
are extracted as described in Rabiner and Juang (1993).
￿ The perceptual linear predictive (PLP) analysis is based on two
perceptually and biologically motivated concepts, namely the
critical bands, and the equal loudness curves. Frequencies below
1 kHz need higher sound pressure levels than the reference,
and sounds between 2 and 5 kHz need less pressure, following
the human perception. The critical band filtering is analogous
to the MFCC triangular filtering, apart from the fact that
the filters are equally spaced in the Bark scale (not the Mel
scale) and the shape of the filters is not triangular, but rather
trapezoidal. After the critical band analysis and equal loudness
conversion, the subsequent steps required for the relative spectral
(RASTA) processing extension follow the implementation
recommendations in Zheng et al. (2001). After transforming
the spectrum to the logarithmic domain and the application
of RASTA filtering, the signal is transformed back using the
exponential function.
3.2.2 Video Features
We investigated a biologically inspired model architecture to study
the performance of form and motion feature processing for emotion
classification from facial expressions. The model architecture builds
upon the functional segregation of form and motion processing in
primate visual cortex. Initial processing is organized along two mainly
independent pathways, each specialized for the processing of form as
well as motion information, respectively.
We have directly utilized the two independent data streams for
visual analysis of facial features in (Glodek et al., 2011) and already
achieved robust results for automatic estimation of emotional user
states from video only and audio-visual data. Here, we extended
the basic architecture by further subdividing the motion-processing
channel. We argue that different types of spatio-temporal information
are available in the motion representation, which can be utilized for
robust analysis of facial data. On the global scale the overall, external,
Search WWH ::




Custom Search