Applications in Intelligent Speech Analysis - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

At the same time, race meta-information constantly down-grades height assessment

in these experiments.

10.4.3.5 Summary

The automatic assessment of speaker's age, gender, and height was shown. Assessing

age and gender in combination was observed to prevail over individual assessment.

As for the Emotion Challenge, the best participants' results were fused by majority

vote. This led to the so far unrivalled upper benchmark of 53.6 % UA for the age

class, and 85.7 % UA for the gender classes—again proving the superiority of fusion

of multiple engines. When classifying height, information on other traits was added

as features. An improvement was observed here as well. There obviously are many

other approaches to exploit such knowledge, e.g., by building age, gender or height

dependent models for any of the other tasks. This will require further experience in

the case of age and height dependent models as to this end reasonable quantisation is

required. A further next step will be to find methods to automatically estimate any of

these at the same time by mutual exploitation of each other. This can be particularly

interesting given different forms of task representation (continuous ordinal or binary)

as was chosen here.

Provided that speech databases contain a transcription of the targeted speaker

information, the combination of different corpora might result in more accurate

results and a versatile applicability of paralinguistic information extraction systems.

Thus, cross-corpus evaluations as published for emotion recognition in [ 161 ] could

be part of future research on combined speaker traits analysis.

Finally, the automatic assessment of certain speaker characteristics such as age

potentially also profits from the inclusion of linguistic features in addition to acoustic

descriptors. This in turn would require an automatic speech recognition module

extracting linguistic information for combined acoustic-linguistic analysis. In the

field of emotion recognition [ 83 ], recent studies have shown that even though the

word accuracies of automatic speech recognisers processing spontaneous, emotional

speech are lower than the word accuracies of dictation systems recognising well-

articulated, read speech, the inclusion of speech recognisers for linguistic feature

generation reliably boosts emotion recognition accuracies. It is of interest whether a

similar behaviour can be observed in the case of traits.

10.4.4 Mid-term: Intoxication and Sleepiness

Apart from the short-term related speaker state emotion, mid-term states exist which

are not permanent, yet do not change instantly. These comprise, for example:

•

(partly) self-induced : sleepiness [ 197 ], intoxication (e.g., alcoholisation [ 77 , 198 ,

199 ]), health state [ 104 ], mood (e.g., depression [ 200 ]);

Search WWH ::

Custom Search

Home