Digital Signal Processing Reference
In-Depth Information
At the same time, race meta-information constantly down-grades height assessment
in these experiments.
10.4.3.5 Summary
The automatic assessment of speaker's age, gender, and height was shown. Assessing
age and gender in combination was observed to prevail over individual assessment.
As for the Emotion Challenge, the best participants' results were fused by majority
vote. This led to the so far unrivalled upper benchmark of 53.6 % UA for the age
class, and 85.7 % UA for the gender classes—again proving the superiority of fusion
of multiple engines. When classifying height, information on other traits was added
as features. An improvement was observed here as well. There obviously are many
other approaches to exploit such knowledge, e.g., by building age, gender or height
dependent models for any of the other tasks. This will require further experience in
the case of age and height dependent models as to this end reasonable quantisation is
required. A further next step will be to find methods to automatically estimate any of
these at the same time by mutual exploitation of each other. This can be particularly
interesting given different forms of task representation (continuous ordinal or binary)
as was chosen here.
Provided that speech databases contain a transcription of the targeted speaker
information, the combination of different corpora might result in more accurate
results and a versatile applicability of paralinguistic information extraction systems.
Thus, cross-corpus evaluations as published for emotion recognition in [ 161 ] could
be part of future research on combined speaker traits analysis.
Finally, the automatic assessment of certain speaker characteristics such as age
potentially also profits from the inclusion of linguistic features in addition to acoustic
descriptors. This in turn would require an automatic speech recognition module
extracting linguistic information for combined acoustic-linguistic analysis. In the
field of emotion recognition [ 83 ], recent studies have shown that even though the
word accuracies of automatic speech recognisers processing spontaneous, emotional
speech are lower than the word accuracies of dictation systems recognising well-
articulated, read speech, the inclusion of speech recognisers for linguistic feature
generation reliably boosts emotion recognition accuracies. It is of interest whether a
similar behaviour can be observed in the case of traits.
10.4.4 Mid-term: Intoxication and Sleepiness
Apart from the short-term related speaker state emotion, mid-term states exist which
are not permanent, yet do not change instantly. These comprise, for example:
•
(partly) self-induced : sleepiness [ 197 ], intoxication (e.g., alcoholisation [ 77 , 198 ,
199 ]), health state [ 104 ], mood (e.g., depression [ 200 ]);
 
Search WWH ::




Custom Search