Digital Signal Processing Reference
In-Depth Information
Table 11.28
Beat-wise BLSTM-RNN classification on the UltraStar test set on 2- and 3-class tasks
[%] - HE LV LV-HE HE-LV
sk l ss s A A A A A A A A A A
Voice 0/1 74.55 74.50 73.82 73.84 75.77 75.81 75.40 75.41 75.09 75.11
Gender 0/m/f 63.75 68.54 65.65 68.91 69.29 71.31 67.90 70.41 68.52 70.44
m/f 86.67 91.09 88.45 91.91 86.93 91.12 89.61 93.60 87.76 92.50
Race 0/w/b+h+a 48.17 63.84 47.46 63.02 49.37 65.46 49.23 63.63 48.40 63.77
w/b+h+a 60.44 65.82 63.30 76.98 55.05 76.18 62.57 78.67 62.78 75.16
Age 0/y/o 51.02 57.61 50.00 57.14 53.50 59.85 51.26 58.86 50.01 57.72
y/o 55.30 55.60 57.55 56.56 53.93 53.63 55.97 54.89 54.69 54.17
Height 0/s/t 53.94 66.79 52.35 66.57 58.15 69.30 57.67 68.41 58.91 69.53
s/t 64.70 72.73 62.31 70.67 66.54 73.00 69.65 77.49 72.07 78.26
Preprocessing: harmonic enhancement by drum-beat separation (HE), leading voice extraction (LV),
and sequential combination of these two
11.8.3 Performance
Supervised training of the networks followed a random initialisation of the network
weights with a Gaussian distribution with zero mean and 0
1 as standard deviation.
For improved generalisation, the order of the input sequences was randomised, and
Gaussian noise with zero mean and 0
.
3 as standard deviation was added to the input
activations. Resilient propagation was used for iterative update of the network's
weights during training. Once no improvement over 20 epochs had been observed
on the validation set, the training was stopped. To cope with the race task's high
imbalance, a fixed number of 20 epochs was run to avoid overfitting to the validation
set, and the standard deviation of the Gaussian noise added to the input activations
was increased to
.
σ =
.
9.
The general imbalance of instances across classes and tasks on the beat level
(cf. Table 11.27 ) renders UA the major performance measure of interest. Singer
presence detection reaches over 75 % UA with the use of leading voice extraction.
On the 2-class gender recognition task, the combination of source separation algo-
rithms leads to the best result with drum-beat separation as last step at 89.61 %
UA. For height recognition, the combination of the pre-processing steps—albeit in
inverse order—also leads to optimal results and 72.07 % UA are reached, increas-
ing UA by more than 7 % absolute compared to no preprocessing. On the 3-class
task, the best UA is 69.29 % UA when using exclusively the isolation of the singing
voice.With the same pre-processing, 2-class recognition of race and age is solved
best at 63.30 % and 57.55 % UA. Age recognition falls behind results on spoken
language (cf. Sect. 10.4.3 ) , but the result is significantly above the chance level of
50 % UA according to a z -test ( p
0
001).
To evaluate semantic singer trait tagging of entire songs, accuracies of a majority
vote on the beat level compared to the most frequent ground truth class on the beat
level are shown in Table 11.29 . Obviously, such a gold standard is 'more of heuristic
nature' given phenomena such as mixed gender duets. On the song level, gender
<
0
.
 
 
Search WWH ::




Custom Search