Information Technology Reference
In-Depth Information
overall diarization error rate by improving the overlap detection rate. Later they
improved their results by combining Long Short-Term Memory (LSTM) recurrent
neural networks (RNNs) with their baseline HMM system (Geiger et al. 2013 ). We
will draw inspiration from this idea by exclusively employing neural networks to
estimate overlap from acoustic features.
Neural networks have previously been utilized successfully in the field of compu-
tational paralinguistics. Stuhlsatz et al. ( 2011 ) introduced generalized discriminant
analysis using deep neural networks (DNNs) for the task of acoustic emotion
recognition and achieved highly significant improvements over previously reported
baselines on a number of frequently used emotional speech corpora. Brueckner and
Schuller ( 2012 ) showed good performance on the Interspeech 2012 Speaker Trait
Likability Sub-Challenge (Schuller et al. 2012 ) adopting a moderately DNN. Using
a hierarchical DNN the same authors only recently obtained the best results reported
in the literature on the ComParE Social Signals Sub-Challenge, outrivaling the
baseline results by 9.1 % (Brueckner and Schuller 2013 ). These results have further
been excelled recently by adopting deep bi-directional LSTM RNNs (Brueckner and
Schuller 2014 ).
Encouraged by this success we investigate and demonstrate how deep and
hierarchical neural networks, which have become the new mainstream paradigm
in automatic speech recognition over the last few years, can be leveraged to auto-
matically classify and predict levels of conflict purely based on audio recordings.
To this end, we resort to a hierarchical DNN to predict the conflict level on
the Conflict Sub-Challenge of the Interspeech 2013 Computational Paralinguistics
Challenge (Schuller et al. 2013 ). We further utilize a bi-directional Long-Short Term
Memory (BLSTM) RNN to predict overlapping speech segments and demonstrate
that a DNN fed with this predicted overlap achieves state-of-the-art performance.
Ultimately, we show that by integrating this predicted overlap into a conversational-
prosodic feature set we can improve the results even further, both for classification
and regression. Using this combined feature set we obtain the best results reported
so far in the literature on this data set for both the classification and the regression
task.
In Sect. 19.2 we describe how to pre-train and build a DNN using Restricted
Boltzmann Machines (RBMs) and how to handle real-valued input using Gaussian-
Bernoulli RBMs. We further outline two recent advances to DNNs, rectified linear
units and dropout. In Sect. 19.3 we briefly describe RNNs, in particular LSTM
models and their bidirectional variant, and show how we will use them to generate
predictions of overlapping speech segments. The underlying database used in
this study is described in Sect. 19.4 and the derived feature sets are discussed in
Sect. 19.5 . We describe and discuss our experiments and results in Sect. 19.6 and
present our conclusions in Sect. 19.7 .
Search WWH ::




Custom Search