Digital Signal Processing Reference
In-Depth Information
Table 2.1 Confusion matrix for human recognition performance for NAW dataset
Happy
Angry
Disgust
Surprised
Sad
Neutral
Happy
76.5
0.0
1.5
12.0
0.0
10.0
Angry
0.0
90.0
5.0
0.0
4.0
1.0
Disgust
2.0
32.5
34.5
6.5
3.0
21.5
Surprised
9.0
2.0
8.0
64.5
1.5
15.0
Sad
0.0
0.0
0.5
0.0
98.0
1.5
Neutral
1.0
0.0
2.5
0.0
0.0
96.5
identified based on speech semantics, facial expression of the speaker, as well
as basic understanding of the situations of the video clips occurrences. These video
clips were converted to MP3 audio files at a sampling rate of 8.0 kHz, mono stream,
and their amplitudes were scaled in the range (
1,
1) V. A number of findings using
รพ
this dataset have been reported earlier in [ 4 , 5 , 13 ].
2.2.3 Human Perception Test
In order to ensure that the video clips obtained for the NAW dataset were correctly
perceived, manual perception test were subsequently carried out. In this test, a total
of 40 human subjects - 11 from Nanyang Technological University, Singapore
(nine males and two females), and 29 from International Islamic University,
Malaysia (15 male and 14 female), with an age mean of 23 years - have volunteered
to provide their perceived assessment of the speech emotion audio files presented.
The participating subjects have reported that they have experienced neutral
emotion prior to the commencement of the human perception test. The survey
was conducted in a laboratory environment where the judges can listen to the
speech emotion audio files with minimal distraction. They sat in front of a computer
and listened to the speech emotion audio files via a headphone to ensure that judges
can hear audio files without interruption. For each speech emotion audio file, they
indicated the perceived emotion on a six-force-choice format representing the
emotion classes with neutral as shown in Table 2.1 .
In order to avoid any misled perception, each speech emotion audio file's name
was labeled using a file number that has no relation to the respective emotion. In
addition, the file numbering was also randomized to avoid any prediction of the
emotion pattern. The human judges were allowed to listen to any of the speech
emotion audio files repeatedly prior to making an appropriate decision.
Table 2.1 shows the confusion matrix for human recognition performance for the
NAW dataset. Here, it can be seen that most judges were able to identify sad , angry ,
neutral , and happy quite easily with at least 76% accuracy. This is followed by
s urprised with 64% accuracy and disgust with only 34% accuracy, respectively.
Disgust yielded very low recognition, which shows that the judges were not clear
with its definition that they might have perceived disgust as mild anger thus
resulting in higher percentage of anger being perceived. Similarly, surprised
Search WWH ::




Custom Search