Information Technology Reference
In-Depth Information
respective first order . Then, for each frame-wise LLD the arithmetic mean and
standard deviation across the frame itself and eight of its neighbouring frames (four
before and four after) are appended. This results in 47 3 D 141 descriptors
per frame. This feature set has been used with great success on the ComParE
Vocalization Sub-Challenge data in previous work (Brueckner and Schuller 2013 ,
2014 ).
Finally, feature set III is a modification of the feature set proposed by Kim et al.
( 2012 ) and consists of a conversational and a prosodic part: The first corresponds to
turn duration statistics, namely mean, median, maximum, variance and minimum of
speaker turns duration in each audio clip as well as the number of turns. It further
includes total speaking time statistics, i.e. mean, median, maximum, variance and
minimum of the total speaking time for individual speakers in the clips as well
as the number of people speaking. We finally add the overlap ratio described in
Sect. 19.6.3 and conclude the conversational feature part with the turn keeping/turn
stealing ratio in the clip, defined as the ratio between the number of times a speaker
change happens and the number of times a speaker change does not happen after an
overlap. This conversational part is complemented by prosodic features including
clip-based statistics: mean, median, standard deviation, maximum, minimum and
quantiles (0.01, 0.25, 0.75 and 0.99) of pitch and intensity statistics obtained from
the entire clip, with the pitch and intensity LLDs being identical to the ones in
feature set I. These general prosodic features are complemented by speaker turn-
based statistics, i.e. mean, median and standard deviation of pitch and intensity
obtained over individual speaker turns (similarly to the clip-base statistics). Note
that the statistics above are estimated not only on single-talker segments, but also
over overlapping speech segments. Altogether, feature set III contains 38 features.
19.6
Experiments and Results
19.6.1
Challenge Baseline Comparison
For the ComParE Conflict Sub-Challenge baseline, results were supplied by the
challenge organizers (Schuller et al. 2013 ) for both a classification task, where each
utterance is classified as being either non-conflictual ( low ) or conflictual ( high ), and
a regression task, trying to predict the rater's score value in the range [ 10IC10 ].
For the classification task the primary evaluation measure was chosen to be the
unweighted average recall (UAR), which had been used since the Interspeech's 2009
Emotion Challenge (Schuller et al. 2011 ). The motivation to use the unweighted
rather than the weighted average recall (“conventional” accuracy) is that it is also
meaningful for highly unbalanced distributions of instances among classes.
Search WWH ::




Custom Search