Information Technology Reference
In-Depth Information
Table 19.6 Results for the oracle overlap ratio as a single
feature (top) and alongside baseline feature set I (bottom) varying
the number of hidden layers on the classification (class) and
regression task (score)
Class
Score
[%]
devel
test
devel
test
75:5 76:7 78:6 79:1
DNN (1-32-32-1) 76 : 4 77 : 2 79 : 2 79 : 6
DNN (1-32-32-32-1) 75:4 76:7 78:9 79:2
MLP (6374-2048-1) 80:6 81:8 81:3 81:8
DNN (6374-1024-1024-1) 81 : 4 82 : 5 81 : 9 82 : 5
DNN (6374-1024-1024-1024-1) 80:5 81:9 81:3 81:7
Shown are the best results obtained on the development set
(devel) and on the test set (test). The percentages reported denote
UAR for the classification task and CC for the regression task
MLP (1-32-1)
In order to see if the overlap ratio adds additional information to the baseline
features, we also trained a series of networks on the overlap ratio being added to the
feature set I, varying the number of hidden units from 64 to 4,096.
Table 19.6 shows the results for the respective optimal number of hidden units,
H hid D 32 for oracle overlap ratio as a single feature and H hid D 1;024 for oracle
overlap ratio alongside feature set I, trained with a different number of hidden layers.
Even when used alone, the oracle overlap ratio predicts conflict class and level
of conflict with high accuracy. When combined with feature set I it yields even
better performance than feature set I alone. It further outperforms the baseline
classification task results and stays on par with the baseline regression task results.
As described in Sect. 19.3.3 , in a real-world application overlap must be reliably
estimated from the speech signal itself. To this end, we used a BLSTM classifier
to generate predictions of overlapping speech. We used the frame-wise feature
set II (cf. Sect. 19.5 ) to train our network, which contained one recurrent hidden
layer consisting of 50 BLSTM memory blocks. We trained this network using the
backpropagation through time (BPTT) algorithm with a learning rate of 10 5 and
momentum 0.9. We initialized the weights sampling from a uniform distribution in
the range [ 0.1;0.1]. We further added noise sampled from a zero-mean Gaussian
distribution with D 0:1 to the network inputs. Training was run until the cross-
entropy (CE) error measure did not show an improvement on the development set
for 10 epochs.
We then used the trained BLSTM network to classify each frame as either
overlapping or non-overlapping speech and computed the predicted overlap ratio
for the full audio clip from the resulting classification result. Using this predicted
overlap ratio either as a single feature or alongside the baseline feature set I, just
as before, we obtained the results shown in Table 19.7 . Using the predicted overlap
ratio as a single feature a DNN with two hidden layers, each containing 32 ReLus
 
Search WWH ::




Custom Search