Digital Signal Processing Reference
In-Depth Information
literature, it is observed that 5 layer symmetric neural networks, with three hidden
layers have been used for different speech tasks. The first and the third hidden layers
have more number of nodes than the input or output layer. The middle layer (also
known as dimension compression layer) contains fewer units [ 23 , 24 ]. In this type
of network, generally the first and third hidden layers are expected to capture the
local information among the feature vectors and the middle hidden layer is meant for
capturing global information. Most of the existing studies [ 23 - 26 ]haveusedthe5
layer AANNs with the structure N 1 L
N 2 N
N 3 N
N 2 N
N 1 L , for their optimal
performance. Here N 1 ,
N 2 , and N 3 indicate the number of units in the first, second
and third layers respectively, of the symmetric 5-layer AANN. Usually N 2 and N 3
are derived experimentally, for achieving the best performance in the given task.
From the existing studies, it is observed that N 2 is in the range of 1.3-2 times N 1 and
N 3 is in the range of 0.2-0.6 times N 1 . For designing the structure of the network,
we have used the guidelines from the existing studies and experimented with few
structures for finalizing the optimal structure. The performance of the network does
not depend critically on the structure of the network [ 21 , 27 - 29 ]. The number of
units in the two hidden layers is guided by the heuristic arguments given above. All
the input and output features are normalized to the range
before presenting
to the neural network. The back-propagation learning algorithm is used for adjusting
the weights of the network to minimize the mean squared error [ 24 ].
[−
1
, +
1
]
2.5 Results and Discussion
In this work, 21 emotion recognition systems (ERS) are developed to study speech
emotion recognition using different spectral features. In the beginning, emotion
recognition systems are developed individually, using MFCCs, LPCCs, and formant
features. Formant features alone have not given appreciably good emotion recogni-
tion performance, therefore, in the later stages, they are used in combination with
the other features. Totally 5 sets of emotion recognition systems are developed as
shown in Fig. 2.9 . They are the ERSs developed using the spectral features derived
from (a) the entire speech signal, (b) the vowel region, (c) the consonant region,
(d) the CV transition region, and (e) pitch synchronous analysis. In each set, emotion
recognition systems are developed using LPCCs, MFCCs, LPCCs
+
formant features,
and MFCCs
formant features. In the following paragraphs, the emotion recognition
performance of all individual emotion recognition systems, developed using Set3 of
IITKGP-SESC, are discussed. Out of 10 speakers' speech data, the utterances of 8
speakers (4 male and 4 female) are used for training the ER models and the utter-
ances of 2 (a male and a female) speakers are used for validating the trained models.
Thirteen spectral features are extracted from a frame of 20 ms, with a shift of 5 ms.
GMMs with 64 components are used to develop ERSs. The results of emotion recog-
nition performance using session and text independent (Set1 and Set2) speech data
are also given at the end of the chapter.
+
 
Search WWH ::




Custom Search