Digital Signal Processing Reference
In-Depth Information
2009-2012 changed this recently: the INTERSPEECH 2009 Emotion Challenge
[ 63 - 65 ], the INTERSPEECH 2010 Paralinguistic Challenge [ 47 , 66 ], the INTER-
SPEECH 2011 Speaker State Challenge [ 55 ], and the INTERSPEECH 2012 Speaker
Trait Challenge [ 67 ]. Further, the first and second International Audio/Visual Emo-
tion Challenge and Workshop (AVEC) in 2011 [ 68 , 69 ] and 2012 [ 70 ] as satellites
of the HUMAINE International Conference on Affective Computing and Intelligent
Interaction (ACII) and ACM International Conference on Multimodal Interaction
contained speech analysis tasks. The aim of this succession of challenges has been
two-fold: First, the concept of a strict partitioning of data into train, development, and
test sets, together with well-defined measures of performance was accomplished—
this is known from established fields such as ASR—in the broad and divergent field
of paralinguistics. Second, the research in these fields mostly lacks in two respects:
small, preselected, prototypical, and often non-natural data sets and the named low
comparability of results. All of these events featured very high participation of the
research community—in the latest of these events 52 research teams participated. By
using methods as presented in this topic, best results on these challenge tasks could
be obtained [ 30 , 54 , 56 , 67 , 71 ]. These and several further benchmark results were
presented in this topic constantly emphasising on reproducibility and accessibility
of data and algorithms by the research community.
Standards were further provided by the openSMILE [ 72 ] and openEAR [ 73 ]
toolkits as presented in this topic (cf. Sect. 6.5 ) . As open-source software, they are
entirely transparent, and the standardised feature sets (cf. Annex for four of these) can
provide a good starting point for many audio analysis tasks. For source separation,
the openBliSSART toolkit [ 4 , 74 ] as was shown in Chap. 8 .
Another part of standardisation is found in the datasets introduced, which are
mostly freely accessible to interested readers and by now found manifold usage, 1
including the following nine that cover a broad range of Intelligent Audio Analysis
tasks: HU-ASA [ 75 ], TUM AVIC [ 53 ], Metacritic [ 76 ], BRD [ 61 ], NTWICM [ 77 ],
Audio Key [ 17 ], FindSounds [ 45 ], Emotional FindSounds [ 62 ], UltraStar [ 2 ].
(V) Deficiencies in current approaches and future perspective in and for the field
were shown in detail for all presented exemplary tasks in the respective sections and
chapters. However, this topic shall be concluded by a more general perspective on
Intelligent Audio Analysis best practice, remaining challenges and a vision on the
future of this field.
Finally, it should be noted that fusion with other modalities—in particular image
and video processing—can lead to improvements formany of the tasks discussed such
as non-linguistic vocalisation recognition [ 26 , 78 ] or emotion recognition [ 79 - 83 ].
Further, successful transfer of the introduced methods such as feature brute-forcing
and LSTM-modelling can be of interest, as was shown for 3D gesture recognition in
[ 23 ] or for CAN-bus data analysis in the car in [ 28 ].
1 http://www.openaudio.eu
Search WWH ::




Custom Search