Be at Odds? Deep and Hierarchical Neural Networks for Classification and Regression of Conflict in Speech - Conflict and Multimodal Communication: Social Research and Machine Intelligence

Information Technology Reference

In-Depth Information

overall diarization error rate by improving the overlap detection rate. Later they

improved their results by combining Long Short-Term Memory (LSTM) recurrent

neural networks (RNNs) with their baseline HMM system (Geiger et al. 2013 ). We

will draw inspiration from this idea by exclusively employing neural networks to

estimate overlap from acoustic features.

Neural networks have previously been utilized successfully in the field of compu-

tational paralinguistics. Stuhlsatz et al. ( 2011 ) introduced generalized discriminant

analysis using deep neural networks (DNNs) for the task of acoustic emotion

recognition and achieved highly significant improvements over previously reported

baselines on a number of frequently used emotional speech corpora. Brueckner and

Schuller ( 2012 ) showed good performance on the Interspeech 2012 Speaker Trait

Likability Sub-Challenge (Schuller et al. 2012 ) adopting a moderately DNN. Using

a hierarchical DNN the same authors only recently obtained the best results reported

in the literature on the ComParE Social Signals Sub-Challenge, outrivaling the

baseline results by 9.1 % (Brueckner and Schuller 2013 ). These results have further

been excelled recently by adopting deep bi-directional LSTM RNNs (Brueckner and

Schuller 2014 ).

Encouraged by this success we investigate and demonstrate how deep and

hierarchical neural networks, which have become the new mainstream paradigm

in automatic speech recognition over the last few years, can be leveraged to auto-

matically classify and predict levels of conflict purely based on audio recordings.

To this end, we resort to a hierarchical DNN to predict the conflict level on

the Conflict Sub-Challenge of the Interspeech 2013 Computational Paralinguistics

Challenge (Schuller et al. 2013 ). We further utilize a bi-directional Long-Short Term

Memory (BLSTM) RNN to predict overlapping speech segments and demonstrate

that a DNN fed with this predicted overlap achieves state-of-the-art performance.

Ultimately, we show that by integrating this predicted overlap into a conversational-

prosodic feature set we can improve the results even further, both for classification

and regression. Using this combined feature set we obtain the best results reported

so far in the literature on this data set for both the classification and the regression

task.

In Sect. 19.2 we describe how to pre-train and build a DNN using Restricted

Boltzmann Machines (RBMs) and how to handle real-valued input using Gaussian-

Bernoulli RBMs. We further outline two recent advances to DNNs, rectified linear

units and dropout. In Sect. 19.3 we briefly describe RNNs, in particular LSTM

models and their bidirectional variant, and show how we will use them to generate

predictions of overlapping speech segments. The underlying database used in

this study is described in Sect. 19.4 and the derived feature sets are discussed in

Sect. 19.5 . We describe and discuss our experiments and results in Sect. 19.6 and

present our conclusions in Sect. 19.7 .

Search WWH ::

Custom Search

Home