Digital Signal Processing Reference
In-Depth Information
Fig. 10.6 Age (in years)
histograms for the train and
develop partitions of aGender
[ 75 ]
20
aGender Train: # speakers (age)
10
0
5
20
35
50
65
80
20
aGender Develop: # speakers (age)
10
0
5
20
35
50
65
80
299 speakers) partitions. Overall, this random speaker-based partitioning results in
roughly 40/30/30 % Train/Develop/Test distribution. Table 10.22 lists the number of
speakers and the number of utterances per class in the Train and Develop partitions,
Fig. 10.6 depicts the number of speakers as a histogram over their age.
The age group can be handled either as combined age/gender task by classes
{
1
,...,
7
}
as indicated in Table 10.22 or as age group task independent of gender
by classes
. For comparison of results though, only the age group infor-
mation is used by mapping
{
C
,
Y
,
A
,
S
}
{
1
,...,
7
}→{
C
,
Y
,
A
,
S
}
as denoted. For gender, the
classes
have to be classified, as gender discrimination of children is con-
siderably difficult, yet it was again decided to keep all instances (cf. Sect. 10.4.2 for
both tasks.
{
f
,
m
,
x
}
10.4.3.2 TIMIT Database
The TIMIT corpus [ 195 ] is well suited for height determination experiments in the
sense that it contains a sufficiently high number of speakers—630 in total. This is
needed when it comes to speaker trait assessment in order to obtain meaningful and
statistically significant results. Each of speaker spoke ten phonetically rich sentences.
The fact that these speakers pronounce the same sentences renders the paralinguistic
task somewhat text dependent, as for several other databases, e.g., partly the aGender
corpus above and in the field of emotion and affective speaker state recognition
where the Berlin, the Danish, and the eNTERFACE emotional speech databases or
the Speech Under Simulated and Actual Stress database show higher limitation in
phonetic content variation [ 161 ]. As stated, in addition to featuring sufficient different
speakers, TIMIT provides a rich amount of meta-information on its speakers' traits:
their age, gender, height, dialect—one out of 8 major American English ones—, their
highest education degree, and race. All TIMIT recordings are in 16 bit, 16 kHz.
Figure 10.7 depicts the distribution of height for the speakers in TIMIT. The non-
continuous distribution of height in the histrogram is because of TIMIT originally
 
 
Search WWH ::




Custom Search