Applications in Intelligent Speech Analysis - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

10.4.3.1 aGender Corpus

For the recording of the aGender corpus, an external company was employed to

identify possible speakers of the targeted age and gender groups [ 75 , 179 ]. The sub-

jects received written instructions on the procedure and a financial reward, the calls

were free of charge. They were asked to ring up the recording system six times with

a mobile phone alternating indoor and outdoor to obtain different recording envi-

ronments. They were prompted by an automated interactive voice response system

to repeat given utterances or produce free content. Between each session a break

of one day was scheduled to ensure more variations of the voices. The utterances

were stored on the application server as 8 bit, 8 kHz, A-law. To validate the data, the

associated age cluster was compared with a manual transcription of the self stated

date of birth.

Four age groups—Child (C), Youth (Y), Adult (A), and Senior (S)—were defined.

Since children are not subdivided into female and male, this results in seven classes

as shown in Table 10.22 .

The content of the database was designed in the style of the Speech Dat corpora.

Each of the six recording sessions contains 18 utterances taken from a set of utter-

ances listed in detail in [ 194 ]. The topics of these were command words , embedded

commands , month , week day , relative time description , public holiday , birth date ,

time , date , telephone number , postal code , first name , last name , yes/no with accord-

ing free or pre-set inventory and according 'eliciting' questions as “ Please tell us

any date, for example the birthday of a family member ”.

In total, 47 h of speech in 65 364 single utterances of 954 speakers were col-

lected. Note that, not all volunteers completed all six calls, and there were cases

where some called more often than six times, resulting in different numbers of

utterances per speaker. The mean utterance length was 2.58 s. 25 speakers were

selected randomly for each of the seven classes as a fixed Test partition (17 332

utterances, 12.45 h) and the other 770 speakers as a Training partition (53 076 utter-

ances, 38.16 h), which was further subdivided into Train (32 527 utterances in 23.43 h

of speech of 471 speakers) and Develop (20 549 utterances in 14.73 h of speech of

Table 10.22 Age and gender classes of the aGender corpus, where f and m abbreviate female and

male, and x represents children without gender discrimination. The last two columns represent the

number of speakers/instances per partition (Train and Develop)

Class

Group

Age

Gender

# Train

# Develop

1

C hild

7-14

x

68 / 4 406

38 / 2 396

2

Y outh

15-24

f

63 / 4 638

36 / 2 722

3

Y outh

15-24

m

55 / 4 019

33 / 2 170

4

A dult

25-54

f

69 / 4 573

44 / 3 361

5

A dult

25-54

m

66 / 4 417

41 / 2 512

6

S enior

55-80

f

72 / 4 924

51 / 3 561

7

S enior

55-80

m

78 / 5 549

56 / 3 826

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home