Be at Odds? Deep and Hierarchical Neural Networks for Classification and Regression of Conflict in Speech - Conflict and Multimodal Communication: Social Research and Machine Intelligence - page 411

Information Technology Reference

In-Depth Information

Table 19.6 Results for the oracle overlap ratio as a single

feature (top) and alongside baseline feature set I (bottom) varying

the number of hidden layers on the classification (class) and

regression task (score)

Class

Score

[%]

devel

test

devel

test

75:5 76:7 78:6 79:1

DNN (1-32-32-1) 76 : 4 77 : 2 79 : 2 79 : 6

DNN (1-32-32-32-1) 75:4 76:7 78:9 79:2

MLP (6374-2048-1) 80:6 81:8 81:3 81:8

DNN (6374-1024-1024-1) 81 : 4 82 : 5 81 : 9 82 : 5

DNN (6374-1024-1024-1024-1) 80:5 81:9 81:3 81:7

Shown are the best results obtained on the development set

(devel) and on the test set (test). The percentages reported denote

UAR for the classification task and CC for the regression task

MLP (1-32-1)

In order to see if the overlap ratio adds additional information to the baseline

features, we also trained a series of networks on the overlap ratio being added to the

feature set I, varying the number of hidden units from 64 to 4,096.

Table 19.6 shows the results for the respective optimal number of hidden units,

H hid D 32 for oracle overlap ratio as a single feature and H hid D 1;024 for oracle

overlap ratio alongside feature set I, trained with a different number of hidden layers.

Even when used alone, the oracle overlap ratio predicts conflict class and level

of conflict with high accuracy. When combined with feature set I it yields even

better performance than feature set I alone. It further outperforms the baseline

classification task results and stays on par with the baseline regression task results.

As described in Sect. 19.3.3 , in a real-world application overlap must be reliably

estimated from the speech signal itself. To this end, we used a BLSTM classifier

to generate predictions of overlapping speech. We used the frame-wise feature

set II (cf. Sect. 19.5 ) to train our network, which contained one recurrent hidden

layer consisting of 50 BLSTM memory blocks. We trained this network using the

backpropagation through time (BPTT) algorithm with a learning rate of 10 5 and

momentum 0.9. We initialized the weights sampling from a uniform distribution in

the range [ 0.1;0.1]. We further added noise sampled from a zero-mean Gaussian

distribution with D 0:1 to the network inputs. Training was run until the cross-

entropy (CE) error measure did not show an improvement on the development set

for 10 epochs.

We then used the trained BLSTM network to classify each frame as either

overlapping or non-overlapping speech and computed the predicted overlap ratio

for the full audio clip from the resulting classification result. Using this predicted

overlap ratio either as a single feature or alongside the baseline feature set I, just

as before, we obtained the results shown in Table 19.7 . Using the predicted overlap

ratio as a single feature a DNN with two hidden layers, each containing 32 ReLus

Next Page

Conflict and Multimodal Communication: Social Research and Machine Intelligence

Search WWH ::

Custom Search

Home