Digital Signal Processing Reference
In-Depth Information
10.4.2.1 FAU Aibo Emotion Corpus
One of the major needs of the emotion recognition community ever since—perhaps
even more than in many related pattern recognition tasks—is the constant need for
data sets. In the early days of the late 1990s, these have not only been few, but also
small (
10), uni-modal, recorded under studio noise
conditions, and acted [ 161 - 163 ]. Further, the spoken content was mostly predefined
(e.g., Danish Emotional Speech (DES), EMO-DB, Speech Under Simulated and
Actual Stress (SUSAS) databases) [ 164 ]. These were seldom made public and few
annotators—if any at all—labelled usually exclusively the perceived emotion. Addi-
tionally, these were partly not intended for analysis, but for quality measurement of
synthesis (e.g., DES, EMO-DB).
Today, there are more diverse emotions covered, more elicited or even sponta-
neous sets of many speakers, and larger amounts of instances (up to 10 k and more)
of more subjects (up to 50), that are annotated by more labellers (4 (AVIC)—17
(VAM, [ 165 ])) and partly made publicly available. For acted data, equal distribution
among classes is of course easily obtainable. Also transcription is becoming more and
more rich: additional annotation of spoken content and non-linguistic interjections
(e.g., AVIC, Belfast Naturalistic, FAU AIBO, SmartKom databases [ 164 ]), multiple
annotator tracks (e.g., VAM), manually corrected pitch contours (FAU AIBO), addi-
tional audio tracks under different noise and reverberation conditions (FAU AIBO),
phoneme boundaries and manual phoneme labelling (e.g., EMO-DB), different units
of analysis, and different levels of prototypicality (e.g., FAU AIBO). At the same
time these are partly also recorded under more realistic conditions (or taken from the
media). Trying to meet the utmost of these requirements, the FAU AIBO database
[ 166 ] was chosen for the first Challenge: It is a corpus with recordings of children
interacting with Sony's pet robot Aibo. The corpus consists of spontaneous, German
speech that is emotionally coloured. The speech is spontaneous, because the chil-
dren were not told to use specific instructions but to talk to the Aibo like they would
talk to a friend. The children were led to believe that the Aibo was responding to
their commands, whereas the robot was actually controlled by a human operator.
The wizard caused the Aibo to perform a fixed, predetermined sequence of actions;
sometimes the Aibo behaved disobediently, thereby provoking emotional reactions.
The data was collected at two different schools, Mont and Ohm, from 51 children
(age 10-13, 21 male, 30 female; about 9.2 h of speech without pauses). Speech was
transmitted with a high quality wireless head set (UT 14/20 TP SHURE UHF-series
with microphone WH20TQG) and recorded with a DAT-recorder (sampling rate
48 kHz, quantisation 16 bit, 48 kHz down-sampled to 16 kHz). The recordings were
segmented automatically into 'turns' using a pause threshold of 1 s. Five labellers
(advanced students of linguistics) listened to the turns in sequential order and anno-
tated each word independently from each other as neutral (default) or as belonging to
one of ten other classes of emotion. Since many utterances are only short commands
and rather long pauses can occur between words due to Aibo's reaction time, the
emotional/emotion-related state of the child can change also within turns. Hence,
500 turns) with few subjects (
 
Search WWH ::




Custom Search