Information Technology Reference
In-Depth Information
with the baseline feature set it further improves upon the results. Using BLSTM
RNNs based on frame-wise features, we predict the overlap ratio and show that this
prediction outperforms the reference feature set, even more when combined with it.
This confirms the findings in Grèzes et al. ( 2013 ).
Encouraged by these results we add the predicted overlap ratio to a carefully
constructed feature set, which combines conversational and prosodic features,
and train DNN models on it. Our best models outperform the Conflict Sub-
Challenge baseline and the best challenge contributions on both the classification
task, predicting if an utterance is conflict or non-conflict, and the regression task,
predicting the level of conflict in the range [ 10:0 , C10:0 ].
Adopting a DNN architecture with two hidden layers of rectified linear units, pre-
trained and fine-tuned on the conversational-prosodic feature set using the dropout
technique, we outperform all previously reported results. On the classification task
we achieve a UAR D 84.3 %, which improves the baseline result by 3.5 % and the
best result reported for the Conflict Sub-Challenge by Räsänen and Pohjalainen
( 2013 ) by 0.4 %. On the regression task the relative improvements are smaller,
still raising the benchmark of the Challenge correlation coefficient of 82.6-83.8 %,
measured on the test set. It is interesting to note that while for the baseline the best
cross-correlation percentage is higher than the best UAR percentage, this is different
in our study. It should be noted, however, that the best results of the baseline were
obtained for different SVM complexity parameters C , as shown in Table 19.4 .
These results are very promising; however, they are partly based on the manual
speaker-turn annotations provided with the database. For a fully automatic system
we intend to continue our study of conflict detection by deploying an automatic
speaker turn detection and diarization system.
Furthermore, the feature set that led to the best results reported in this study
was hand-tuned and while showing its high potential the selection of features might
still be sub-optimal. We therefore intend to complement our current approach by
adopting some sort of feature selection algorithm, e.g. as suggested in Räsänen and
Pohjalainen ( 2013 ). This way we hope to further improve upon the current results.
Acknowledgements The research presented in this publication was conducted while the first
author was employed by Nuance Communications Deutschland GmbH.
References
Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin
Boakye K, Vinyals O, Friedland G (2011) Improved overlapped speech handling for speaker
diarization. In: Proceedings of interspeech, ISCA, Florence, Aug 2011, pp 941-944
Bousmalis K, Mehu M, Pantic M (2009) Spotting agreement and disagreement: a survey of
nonverbal audiovisual cues and tools. In: Proceedings of the 3rd international conference
on affective computing and intelligent interaction and workshops, ACII 2009, vol 2. IEEE
Computer Society Press, Los Alamitos
Brueckner R, Schuller B (2012) Likability classification - a not so deep neural network approach.
In: Proceedings of interspeech, Portland, OR, Sep 2012
Search WWH ::




Custom Search