Applications in Intelligent Music Analysis - Intelligent Audio Analysis - page 277

Digital Signal Processing Reference

In-Depth Information

(a)

(b)

(c)

(d)

10

20

30

40

50

60

150

160

170

180

190

Fig. 11.20 UltraStar Singer Trait Database's distribution of traits among its 516 contained singers

[ 36 ]. a Gender, b Race, c Age [years], d Height [cm]

boxes ranging from the first to the third quartile and values exceeding this range by

more than a factor of 1.5 shown as outliers by circles. The fact that singer age is a

function of a musical piece's recording date was taken into account.

For automatic assessment, the tasks were constrained to binary and ternary clas-

sification tasks on frame (beat) level as well as on song level. This decision needed to

be made owing to the challenging real-world conditions given when assessing singer

traits in polyphonic music. Such binary classification provides a simple categorisa-

tion per singer trait, and ternary classification is carried out to perform simultaneous

singing activity detection on frame level in order to provide full realism. Height and

age were discretised to 'small' (s,

<

175 cm) and 'tall' (t,

≥

175 cm), respectively

'young' (y,

30 years). From the annotated race classes the

sparse classes 'Asian', 'Black', and 'Hispanic' were clustered as opposed to 'White'

singers.

The number of beats for task evaluation are shown in Table 11.27 . The annotation

is available for reproduction of results. 13

<

30 years) and 'old' (o,

≥

11.8.2 Methodology

Given the challenging condition of person trait recognition under singing in poly-

phonic music, finding the optimal preprocessing by suited singer separation becomes

a focus issue. To this end, harmonic enhancement as was shown in Sect. 11.1 basing

on openBliSSART (cf. Sect. 11.8 is used as a first means. This will now be followed

by targeted extraction of the leading voice as in [ 170 ]. Different sets of NMF compo-

nents shall be used in different parts of a song for higher flexibility of the algorithm.

A song is therefore chunked into frame-synchronous non-overlapping chunks of

881 664 samples (

≈

20 s at 44.1 kHz sample rate) as in [ 35 ]. Then, the leading voice

13

http://www.openaudio.eu/UltraStar_Singers.arff

Next Page

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home