Digital Signal Processing Reference
In-Depth Information
biological trait primitives such as height [ 76 , 178 ], weight, age [ 75 , 179 ], gender
[ 75 , 179 ];
group/ethnicity membership : race/culture/social class with a weak borderline
towards other linguistic concepts, i.e., speech registers such as dialect or nativeness
[ 180 ];
personality traits : likability [ 181 , 182 ];
personality in general , 'Big Five' personality traits (openness, conscientiousness,
extroversion, agreeableness, and neuroticism) [ 183 - 185 ];
speaker idiosyncrasies , i.e., speaker-ID [ 186 ].
As examples, the traits age and gender, as were featured in the INTERSPEECH
2010 Paralinguistic Challenge, and additionally speaker height are discussed in the
ongoing. As for age and gender, either mostly prosodic supra-segmental features
have been employed, or frame-level features based on MFCCs, and their optimal
fusion [ 187 ]. For speaker height, very sparse research was carried out so far [ 178 ,
188 ]. The authors in [ 188 ] examined the ability of listeners to determine the speaker's
height and weight from speech samples and found that especially for male speakers,
listeners are able to estimate a speaker's height and weight to a certain degree. A
similar study is documented in [ 189 ] and deals with the assignment of photographs
to voices as well as the estimation of a speaker's age, height, and weight via speech
samples. The relationship between formant frequencies and body size was examined
in [ 190 ]. Especially for female participants, a significant correlation between formant
parameters and height could be found. Another study revealed significant negative
correlations between F 0 , formant dispersion and body shape and weight of male
speakers [ 191 ].
For the actual experiments, the aGender corpus was provided for age determi-
nation in four groups and gender determination in three groups (female, male, and
children). It consists of 46 h of telephone speech from 954 speakers. For height deter-
mination in centimetres the commonly known TIMIT corpus is picked—though orig-
inally intended for automatic speech recognition experimentation, it provides rich
speaker trait information and speakers in sufficient number. This meta information
includes the speaker trait target task height with the additional speaker information
of speaker age, gender, dialect region, education level, and race. Note that the term
'race' stems from the available TIMIT corpus meta-information (cf. also Sect. 11.8 ) .
As feature information the set provided for the INTERSPEECH 2010 Paralin-
guistic Challenge baseline calculation is used for all three traits. Note, however, that
height assessment was not part of the 2010 challenge and is only featured here as
additional long-term trait example. Classification and regression of instances in this
systematically brute-forced feature space is done with SVM and SVR—a choice
motivated by the high popularity of these two variants in the broader field of speaker
state and trait assessment [ 83 , 192 , 193 ].
Search WWH ::




Custom Search