Digital Signal Processing Reference
In-Depth Information
Table 11.29 Song-wise BLSTM-RNN predictions on the UltraStar test set by beat-wise majority
vote among 3-class tasks (ignoring beats classified as 0) or 2-class tasks
[ % ] Vo t e o n - H E LV LV- H E H E - LV
Task UA WA UA WA UA WA UA WA UA WA
Gender 0/m/f 80.9 87.0 81.7 85.6 87.7 90.9 91.3 92.4 87.7 90.9
m/f 86.9 90.1 89.0 90.9 87.7 90.9 89.6 93.9 89.6 93.9
Race 0/w/b+h+a 49.8 78.8 53.5 79.7 51.0 78.2 54.0 75.2 48.9 72.2
w/b+h+a 52.8 59.8 62.6 75.9 54.7 73.7 64.4 78.9 61.7 74.4
Age 0/y/o 55.2 54.5 54.6 54.1 56.0 54.1 56.9 57.4 50.9 51.6
y/o 54.5 54.5 57.0 55.7 52.2 51.6 53.4 52.5 58.9 58.2
Singer height is not included on this level due to sparseness: only 88 songs have a known ground
truth. Preprocessing steps are equivalent to the results shown in Table 11.28
recognition reaches 91.3 % UA, race recognition 64.4 % UA and age recognition
58.9 % UA. Interestingly, for gender, voting on all beats is more robust than voting
exclusively over beats with voice presence. This might be explained by the fact that
BLSTM RNNs model bi-directional context and thus consider neighbouring frames
in their decisions, ie the predictions for parts without vocals are influenced by the
features of the vocal parts. Across the tasks and settings, the combination of the two
pre-processing steps, first leading voice extraction, then drum-beat separation, gives
best results.
11.8.4 Summary
Fully automatic assessment of paralinguistic traits (age, height and race) was demon-
strated in this section based on vocals in original pop-music rather than on more or
less clean speech as was shown in Sect. 10.4.3 . Gender recognition was observed to
give 'application-ready' results even on the beat level for unseen test data. Race and
height classification show general feasibility, even in a such highly realistic setting.
An interdependency of race and musical genre might be given—yet, taking the fact
into account that source separation generally improved performance can be seen as
indication that the networks at least partly are capable of race recognition.
The quite good results for height classification certainly stem from the correlation
with gender. Age recognition results were lower than those reported on speech in
Sect. 10.4.3 , where four classes of age were discriminated rather than two here—at
around similar performance. The challenge besides music 'disturbance' and singing
voice may be owing to the considered type of 'chart' music, where many singers are
at a similar age. Using only males for training and testing of age classification in an
additional test-run, however, led to 61.63 % UA—female singers were too sparse in
the set.
Next efforts could analyse the influence of longer units of analysis than the beat
level, such as the supra-segmental functionals as were used for paralinguistic analy-
sis in speech (cf. Sect. 10.4.3 ) . In that case, however, feature variation owing to the
 
 
Search WWH ::




Custom Search