Information Technology Reference
In-Depth Information
19.3.3
BLSTM as Overlap Prediction Generators
In the introduction we have outlined that overlap is a very informative feature for
conflict level prediction and we will confirm this observation in Sect.
19.6.3
. In real-
world applications manual speech overlap annotations are not available and thus
must be reliably estimated from the speech signal itself. We extend an approach
presented in Geiger et al. (
2013
) by using a BLSTM model as a non-linear classifier
to generate frame-wise overlap predictions. To this end, we feed the input feature
vector
X D Œx.1/;:::;x.T/
(19.21)
into the network, where
T
is the total number of frames in the audio sequence, and
obtain an output
y.t/
at the
sigmoid
output layer for each time step
t
. Due to the
BLSTM nature of our network the output
y.t/
is dependent on both past and future
input, up to time
t
:
y.t/ D g
f
x.1/;:::;x.t/
C g
b
x.T/;:::;x.t/
;
(19.22)
where
g
f
and
g
b
denote the function computed by the forward and backward part
of the BLSTM, respectively.
For training the network, the targets are defined as
1;
ifx
.
t
/ 2
overlap
y.t/ D
(19.23)
0;
else
As in Geiger et al. (
2013
) the predictions
y.t/
of the trained network are used for
classification by adopting the threshold
as follows:
1;
ify
.
t
/ ™
c.t/ D
(19.24)
0;
ify
.
t
/<™
The threshold
can be varied in order to select a specific operating point with a
different trade-off between precision and recall.
19.4
Database
The experiments and results presented in this study are based on the
SSPNet Conflict
Corpus
(
SC
2
)(Kimetal.
2012
), which was also used in the Conflict Sub-Challenge
of the Interspeech 2013 Computational Paralinguistics Challenge (Schuller et al.
2013
). It contains 1,430 clips, each 30 s long, extracted from the Canal9 Cor-
pus (Vinciarelli et al.
2009
), a publicly available corpus of broadcasted Swiss
Search WWH ::
Custom Search