Digital Signal Processing Reference
In-Depth Information
study, Set3 of IITKGP-SESC is used as the dataset, where 8 speakers' speech data
is used for training the emotion recognition models and 2 speakers' speech data is
used for validating them. Chapter 5 discusses about emotion recognition using speak-
ing rate features. In this chapter, emotion recognition is carried out in two stages.
Set3 of IITKGP-SESC is used to evaluate the performance of two-stage emotion
recognition systems developed based on a speaking rate features. A linear combi-
nation of the measures from different features is used while combining the features.
Chapter 6 explores the combination of speech features, for recognizing the real-life
emotions. Five emotions collected from Hindi commercial and art movies are used
to represent real-life emotions. Multi-speaker emotion speech data of around 12min
collected from Hindi movies is used for building each of the emotion models. The
test utterances of duration 2-3 s, derived from the remaining 3min of data are used
for evaluating the trained models.
4.2 Feature Combination: A Study
From the production perspective, speech is a convolved outcome of excitation source
and the vocal tract system response. Prosodic features are extracted from the longer
speech segments to represent perceptual quality of the speech, such asmelody, timbre,
rhythm and so on. Ideally, the three speech features used in this topic (excitation
source, vocal tract system and prosodic) for emotion recognition represent three
different aspects of speech production and perception. Therefore, it is believed that
they contain non-overlapping and supplementary emotion-related information. In
this chapter, we intend to exploit the supplementary nature of these features, by
combining their measures to improve the emotion recognition performance [ 1 ].
Scientifically, emotions are studied from different viewpoints. Psychologists
tried to map all the emotions onto three dimensional space, known as emotional
space. These dimensions are arousal (activation), pleasure (valence), and dominance
(power). Generally it is known that the group of emotions like anger, happiness,
and fear has high arousal characteristics, similarly, disgust and sarcasm have nega-
tive arousal characteristics. For discriminating these emotions, within the group, the
other dimensions such as valence may be needed.
Generally, the arousal characteristic of the emotions is represented by the prosodic
parameter intensity . The intensity characteristics in turn influence the other prosodic
parameters. For instance, the higher arousal characteristics indicate high energy,
leading to high pitch and low duration. For example, high arousal emotions like anger
and fear have high energy, intonation and smaller duration. These emotions cannot
be discriminated using only prosodic features. Along with prosodic features, the
features representing other emotion dimensions such as valence are essential. Thus,
mis-classification of emotions within the set of emotions sharing similar acoustical
properties, may be reduced. Table 4.1 highlights some results from the literature,
where anger and happiness are mis-classified among themselves, when prosodic
features are used. From the results presented in the Chap. 2 , it may be noted that
 
Search WWH ::




Custom Search