Digital Signal Processing Reference
In-Depth Information
Table 10.18 Number of
instances for the 2-class
emotion problem
#
NEG
IDL
Train
3 358
6 601
9 959
Test
2 465
5 792
8 257
5 823
12 393
18 216
the data is labelled on the word level. If three or more labellers agreed, the label was
attributed to the word. All in all, there are 48 401 words.
Classification experiments on a subset of the corpus [ 166 ] showed that the best unit
of analysis is neither the word nor the turn, but some intermediate chunk being the best
compromise between the length of the unit of analysis and the homogeneity of the
different emotional / emotion-related states within one unit. Hence, manually defined
chunks based on syntactic-prosodic criteria [ 166 ] are used here (cf. also [ 167 ]). The
whole corpus consisting of 18 216 chunks was used for the 2009 Emotion Challenge.
The two-class problem that was chosen as example for this topic consists of the
cover classes NEGative (subsuming angry , touchy , reprimanding , and emphatic ) and
IDLe (consisting of all non-negative states). A heuristic approach similar to the one
applied in [ 166 ] is used to map the labels of the five labellers on the word level onto
one label for the whole chunk. Since the whole corpus is used, the classes are highly
unbalanced. The frequencies for the two-class problem are given in Table 10.18 .
Speaker independence is guaranteed by using the data of one school (Ohm, 13 male,
13 female) for training and the data of the other school (Mont, 8 male, 17 female)
for testing. In the training set, the chunks are given in sequential order and the chunk
id contains the information which child the chunk belongs to. In the test set, the
chunks are presented in random order without any information about the speaker.
Additionally, the transliteration of the spoken word chain of the training set and the
vocabulary of the whole corpus is provided allowing for ASR training and linguistic
feature computation.
For the the second task considered as for dealing with short-term user states,
namely determination of speaker interest, the TUM AVIC database (cf. Sect. 5.3.1 )
was used in the follow-up challenge in 2010. It features 2 h of human conversational
speech recording (21 subjects), annotated in five different levels of interest. The
corpus further features a uniquely detailed transcription of spoken content with word
boundaries by forced alignment, non-linguistic vocalisations, single annotator tracks,
and the sequence of (sub-)speaker-turns.
10.4.2.2 Methodology
In the past, the main focus was on prosodic features, in particular pitch, durations
and intensity [ 168 ]. Comparably small feature sets (10-100) were first utilised. In
only a few studies, low level feature modelling on a frame level was pursued, usually
by HMM or GMM. The higher success of static feature vectors derived by projection
of the LLD such as pitch or energy by descriptive statistical functional application
 
Search WWH ::




Custom Search