Digital Signal Processing Reference
In-Depth Information
developed using word level prosodic features. Each block of Fig. 3.5 contains the
ERS shown in Fig. 3.3 . From each portion of the sentence, global and local prosodic
features are computed and the ERSs are developed as shown in Fig. 3.3 . Later the
measures from initial, middle and final words are further combined to get overall
measure.
Table 3.10 shows the average emotion recognition performance of word level
global and local emotion recognition systems. Table 3.11 shows the overall emotion
recognition performance using word level prosodic features, obtained by score level
combination of local and global features of initial, middle, and final words.
In Table 3.10 , column Global indicates the results obtained by using only global
prosodic features. Under local prosodic parameters, systems are individually devel-
oped using duration, pitch and energy components. Columns with headings Dur
,
Pitch , and Energy show the results obtained using the dynamic nature of dura-
tion, pitch and energy parameters respectively. Column 'Local' indicates the results
obtained due to the score level combination of duration, pitch, and energy parame-
ters with appropriate weighting factors. Column Glo
.
indicates the emotion
recognition performance due to the combination of measures from global ( Glo
. +
Loc
.
.
) and
local ( Loc
) systems. All these results are reported for initial, middle and final words
of the utterances of IITKGP-SESC.
From the results of Table 3.10 , it is evident that, all parts of the utterances do
not contribute uniformly toward emotion recognition. Some of the important obser-
vations are mentioned below. There is a drastic improvement in the recognition
performance by using local prosodic features compared to global features. However,
improvement in the recognition performance is marginal over the local features,
when global and local features are combined. This indicates that, global prosodic
features at word level, may not be complementary in nature, with respect to their local
counterparts. In the case of individual local prosodic features, energy features are
more discriminative with initial words of the utterances. This is obvious, as generally,
all utterances have dominant energy profiles in the beginning. In the case of middle
words, energy and pitch parameters have almost equal emotion discrimination with
the recognition rate of 43 and 44% respectively. Pitch values have the dominant dis-
tinction in the case of emotion recognition using final words. Duration information
has always been least discriminative in case of initial, middle, and final words. In
general final words carry more emotion discriminative information of about 64%,
compared to their initial and middle counterparts. It is observed that recognition
performance using final words is almost the same as the performance achieved using
the entire sentence. It indicates that only about
.
rd portion of the sentence (final
part) is sufficient to recognize the emotions. Interestingly, the average performance
obtained due to the combination of the measures of initial, middle, and final words is
almost equal to the recognition rate obtained using entire utterances (See Table 3.8 ).
Comparison of emotion recognition performance of individual emotions with respect
to initial, middle, and final words is given in Fig. 3.6 . It may be seen from the figure
that passive emotions like disgust, sadness, neutral, and surprise are better discrim-
inated using final words of the sentences. Initial words played an important role in
(
1
/
3
)
Search WWH ::




Custom Search