Digital Signal Processing Reference
In-Depth Information
(a)
Anger
Sad
Fear
Neutral
Happy
200
100
0
1
3
5
7
9
11
13
Syllable position in the sentence
(b)
0.5
Anger
Sad
Fear
Neutral
Happy
0
0
20
40
60
80
100
120
140
160
Frame number
(c)
500
Anger
Sad
Fear
Neutral
Happy
400
300
200
0
20
40
60
80
100
120
140
160
Frame number
Fig. 3.1 a Duration patterns for the sequence of syllables, b energy contours, and c pitch contours
in different emotions for the utterance “ mAtA aur pitA kA Adar karnA chAhie
emotions, even though they have similar average values. Thus, Fig. 3.1 provides the
basic motivation to explore the dynamics of prosodic features for discriminating the
emotions.
By observing pitch contours from Fig. 3.1 c, it may be noted that for a given data,
initial portions of the plots (the sequence of 20 pitch values) do not carry uniform
discriminative information. Static features are almost the same for happiness and
neutral. However, static features may be used to distinguish anger, sadness, fear
emotions as their static pitch values are spread widely between 250 and 300Hz.
Similarly dynamic features are almost the same for all emotions except for fear. One
may observe the initial decreasing and gradual rising trends of pitch contours for
anger, happiness, neutral, and sadness emotions, whereas for the fear pitch contour
starts with a rising trend. Similar local discriminative properties may also be observed
in the case of energy and duration profiles from the initial, middle and final parts
of the utterances. This phenomenon indicates that it may be sometimes difficult to
classify the emotions based on either global or local prosodic trends derived from the
entire utterance. Therefore, in this work, we intend to explore the static (global) and
dynamic (local) prosodic features, along with their combination for speech emotion
recognition at different levels (utterance, words, and syllables) and positions (initial,
middle, and final).
3.4 Extraction of Global and Local Prosodic Features
In this Chapter, emotion recognition (ER) systems are developed using local and
global prosodic features, extracted from sentence, word and syllable levels. Word
and syllable boundaries are identified using vowel onset points (VOPs) as the anchor
 
Search WWH ::




Custom Search