Be at Odds? Deep and Hierarchical Neural Networks for Classification and Regression of Conflict in Speech - Conflict and Multimodal Communication: Social Research and Machine Intelligence

Information Technology Reference

In-Depth Information

with the baseline feature set it further improves upon the results. Using BLSTM

RNNs based on frame-wise features, we predict the overlap ratio and show that this

prediction outperforms the reference feature set, even more when combined with it.

This confirms the findings in Grèzes et al. ( 2013 ).

Encouraged by these results we add the predicted overlap ratio to a carefully

constructed feature set, which combines conversational and prosodic features,

and train DNN models on it. Our best models outperform the Conflict Sub-

Challenge baseline and the best challenge contributions on both the classification

task, predicting if an utterance is conflict or non-conflict, and the regression task,

predicting the level of conflict in the range [ 10:0 , C10:0 ].

Adopting a DNN architecture with two hidden layers of rectified linear units, pre-

trained and fine-tuned on the conversational-prosodic feature set using the dropout

technique, we outperform all previously reported results. On the classification task

we achieve a UAR D 84.3 %, which improves the baseline result by 3.5 % and the

best result reported for the Conflict Sub-Challenge by Räsänen and Pohjalainen

( 2013 ) by 0.4 %. On the regression task the relative improvements are smaller,

still raising the benchmark of the Challenge correlation coefficient of 82.6-83.8 %,

measured on the test set. It is interesting to note that while for the baseline the best

cross-correlation percentage is higher than the best UAR percentage, this is different

in our study. It should be noted, however, that the best results of the baseline were

obtained for different SVM complexity parameters C , as shown in Table 19.4 .

These results are very promising; however, they are partly based on the manual

speaker-turn annotations provided with the database. For a fully automatic system

we intend to continue our study of conflict detection by deploying an automatic

speaker turn detection and diarization system.

Furthermore, the feature set that led to the best results reported in this study

was hand-tuned and while showing its high potential the selection of features might

still be sub-optimal. We therefore intend to complement our current approach by

adopting some sort of feature selection algorithm, e.g. as suggested in Räsänen and

Pohjalainen ( 2013 ). This way we hope to further improve upon the current results.

Acknowledgements The research presented in this publication was conducted while the first

author was employed by Nuance Communications Deutschland GmbH.

References

Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin

Boakye K, Vinyals O, Friedland G (2011) Improved overlapped speech handling for speaker

diarization. In: Proceedings of interspeech, ISCA, Florence, Aug 2011, pp 941-944

Bousmalis K, Mehu M, Pantic M (2009) Spotting agreement and disagreement: a survey of

nonverbal audiovisual cues and tools. In: Proceedings of the 3rd international conference

on affective computing and intelligent interaction and workshops, ACII 2009, vol 2. IEEE

Computer Society Press, Los Alamitos

Brueckner R, Schuller B (2012) Likability classification - a not so deep neural network approach.

In: Proceedings of interspeech, Portland, OR, Sep 2012

Search WWH ::

Custom Search

Home