Digital Signal Processing Reference
In-Depth Information
Table 4.1 Percentage of mis-classification between anger and happiness emotions, quoted by
different research works, in the literature, by using prosodic features
Reference
Language
Percentage
of anger
utterances
classified as
happiness
Percentage of
happiness
utterances
classified as
anger
Serdar Yeldirim, et.al. [ 2 ]
English
42
31
Dimitrios Virviridis, et.al. [ 3 ]
Scandinavian language
(Danish)
20
14
Felix Burkhardt, et.al. [ 4 ]
German (Berlin
Emotion Database)
-
12
Oudeyer Pierre, et.al. [ 5 ]
Concatenated synthesis
(English)
35
30
S G. koolagudi, et. al. [ 6 ]
Telugu
-
34
S. G. Koolagudi
Berlin, German
27
20
Raquel Tato [ 7 ]
German
24
25
this type of mis-classification of emotions mentioned above can be reduced by some
extent by using spectral features. Hence, it is hypothesized that the combination of
different features may improve the emotion recognition performance and make the
systems more robust.
In this work, speech features extracted from excitation source, VT system, and
prosodic aspects are combined for analyzing the emotion recognition. Among various
excitation source, spectral and prosodic features proposed in the previous chapters,
the following features are considered in this chapter for combining measures. In
each case, the best performing features are chosen for combination. They are (a)
LP residual samples chosen around glottal closure instants, (b) Twenty one LPCCs
extracted from the whole speech utterance, using a block processing approach, and
(c) local prosodic features, capturing duration, pitch and energy profiles of the speech
utterance. While combining the features, score level fusion is preferred as the features
are extracted using different mechanisms. For example, spectral features are extracted
from the frame level, excitation source features are extracted at the epoch level, and
prosodic features are extracted from the utterance level. Hence, feature level fusion
is not suitable to combine the features derived using heterogeneous approaches.
In this work, AANNs are used to capture the emotion specific information from
excitation source features, GMMs are used for developing the models using spectral
features, and SVMs are used to discriminate the emotions using prosodic features.
Since the measures are derived from different models and features, they need to be
normalized appropriately before combining. The weighted combination of scores for
(a) excitation source and vocal tract system features, (b) source and prosodic features,
(c) system and prosodic features, and (d) source, system and prosodic features is
studied and the results are discussed in the following sections.
 
Search WWH ::




Custom Search