Digital Signal Processing Reference
In-Depth Information
10.4.3.1 aGender Corpus
For the recording of the aGender corpus, an external company was employed to
identify possible speakers of the targeted age and gender groups [ 75 , 179 ]. The sub-
jects received written instructions on the procedure and a financial reward, the calls
were free of charge. They were asked to ring up the recording system six times with
a mobile phone alternating indoor and outdoor to obtain different recording envi-
ronments. They were prompted by an automated interactive voice response system
to repeat given utterances or produce free content. Between each session a break
of one day was scheduled to ensure more variations of the voices. The utterances
were stored on the application server as 8 bit, 8 kHz, A-law. To validate the data, the
associated age cluster was compared with a manual transcription of the self stated
date of birth.
Four age groups—Child (C), Youth (Y), Adult (A), and Senior (S)—were defined.
Since children are not subdivided into female and male, this results in seven classes
as shown in Table 10.22 .
The content of the database was designed in the style of the Speech Dat corpora.
Each of the six recording sessions contains 18 utterances taken from a set of utter-
ances listed in detail in [ 194 ]. The topics of these were command words , embedded
commands , month , week day , relative time description , public holiday , birth date ,
time , date , telephone number , postal code , first name , last name , yes/no with accord-
ing free or pre-set inventory and according 'eliciting' questions as “ Please tell us
any date, for example the birthday of a family member ”.
In total, 47 h of speech in 65 364 single utterances of 954 speakers were col-
lected. Note that, not all volunteers completed all six calls, and there were cases
where some called more often than six times, resulting in different numbers of
utterances per speaker. The mean utterance length was 2.58 s. 25 speakers were
selected randomly for each of the seven classes as a fixed Test partition (17 332
utterances, 12.45 h) and the other 770 speakers as a Training partition (53 076 utter-
ances, 38.16 h), which was further subdivided into Train (32 527 utterances in 23.43 h
of speech of 471 speakers) and Develop (20 549 utterances in 14.73 h of speech of
Table 10.22 Age and gender classes of the aGender corpus, where f and m abbreviate female and
male, and x represents children without gender discrimination. The last two columns represent the
number of speakers/instances per partition (Train and Develop)
Class
Group
Age
Gender
# Train
# Develop
1
C hild
7-14
x
68 / 4 406
38 / 2 396
2
Y outh
15-24
f
63 / 4 638
36 / 2 722
3
Y outh
15-24
m
55 / 4 019
33 / 2 170
4
A dult
25-54
f
69 / 4 573
44 / 3 361
5
A dult
25-54
m
66 / 4 417
41 / 2 512
6
S enior
55-80
f
72 / 4 924
51 / 3 561
7
S enior
55-80
m
78 / 5 549
56 / 3 826
 
 
Search WWH ::




Custom Search