Information Technology Reference
In-Depth Information
19.3.3
BLSTM as Overlap Prediction Generators
In the introduction we have outlined that overlap is a very informative feature for
conflict level prediction and we will confirm this observation in Sect. 19.6.3 . In real-
world applications manual speech overlap annotations are not available and thus
must be reliably estimated from the speech signal itself. We extend an approach
presented in Geiger et al. ( 2013 ) by using a BLSTM model as a non-linear classifier
to generate frame-wise overlap predictions. To this end, we feed the input feature
vector
X D Œx.1/;:::;x.T/
(19.21)
into the network, where T is the total number of frames in the audio sequence, and
obtain an output y.t/ at the sigmoid output layer for each time step t . Due to the
BLSTM nature of our network the output y.t/ is dependent on both past and future
input, up to time t :
y.t/ D g f x.1/;:::;x.t/ C g b x.T/;:::;x.t/ ;
(19.22)
where g f and g b denote the function computed by the forward and backward part
of the BLSTM, respectively.
For training the network, the targets are defined as
1;
ifx . t / 2 overlap
y.t/ D
(19.23)
0;
else
As in Geiger et al. ( 2013 ) the predictions y.t/ of the trained network are used for
classification by adopting the threshold as follows:
1;
ify . t / ™
c.t/ D
(19.24)
0;
ify . t /<™
The threshold can be varied in order to select a specific operating point with a
different trade-off between precision and recall.
19.4
Database
The experiments and results presented in this study are based on the SSPNet Conflict
Corpus ( SC 2 )(Kimetal. 2012 ), which was also used in the Conflict Sub-Challenge
of the Interspeech 2013 Computational Paralinguistics Challenge (Schuller et al.
2013 ). It contains 1,430 clips, each 30 s long, extracted from the Canal9 Cor-
pus (Vinciarelli et al. 2009 ), a publicly available corpus of broadcasted Swiss
Search WWH ::




Custom Search