Digital Signal Processing Reference
In-Depth Information
followed if existent—at least in addition to individual solutions. Multi-corpus and
cross-corpus evaluation, such as for age and gender and for emotion [ 86 , 87 ], is crucial
to assess generalisation of AMs and LMs. In fact, experiments in cross-corpus man-
ner indicate overfitting to single corpora [ 87 ]. This trend can only partly be eased by
corpus adaptation and normalisation. In addition, optimisation such as feature selec-
tion or parameter optimisation for the learning algorithm may exhibit low cross-data
generalisation [ 88 ]. Still, unification of the labelling schemes as mentioned above
introduces 'information loss'. A late fusion of multiple classifiers trained on sin-
gle corpora with different labelling schemes may help overcome this in the future.
In addition, the efficacy of semi-supervised learning to leverage unlabelled audio
data for its computationally intelligent analysis has been repeatedly demonstrated
[ 46 , 89 - 91 ]. This may be turned into large-scale studies across multiple tasks using
large amounts of data acquired from the web. Finally, a promising technique is syn-
thesis of training data: In fact, it has been shown that generalisation properties of
models in a cross-corpus setting can be improved through joint training with both
human and synthetic speech [ 44 ] or human-played and MIDI-synthesised music
[ 92 ]. These results are very promising since synthetic audio can be easily produced
in large quantities, and a variety of combinations can be simulated. It is hoped that
this will yield good generalisation of models and facilitate learning of multiple tasks
and their interdependencies. In any case, acquisition of more and well-defined data
for building robust and generalising models can thus be seen as major challenges for
the future.
Source separation : The results cited in this topic clearly demonstrated the gain
obtained by source separation for the enhancement of the signal of interest in real-
life audio streams. As particularly suited algorithm NMF and its derivatives was
shown—e.g., in the openBliSSART implementation. This may be added by methods
exploiting multiple sources such as ICA for stereophonic information or microphone
array feed exploitation.
Feature brute-forcing : The features used in early Intelligent Audio Analysis
research were often motivated by the fields of ASR and speaker recognition as these
were among the earliest and the driving forces. As a consequence, usage of spectral
or cepstral features such as MFCCs prevails to the present day [ 84 ]. In the meantime,
manifold expert-crafted acoustic features, including perceptually motivated ones
[ 55 , 93 , 94 ] or such basing on pre-classification [ 95 ] were introduced. These have
often been successfully evaluated for diverse audio analysis tasks aswas also shown in
this topic, along with the addition of more or less brute forced features. Furthermore,
it has repeatedly been shown that enlarging the feature space can help boost accuracy
[ 11 , 55 ]. Such large spaces can be brute-forced by toolkits as openSMILE and serve
as broad basis for subsequent space optimisation—in particular when approaching
novel audio analysis tasks. A promising additional direction is the semi-supervised
learning of features, e.g., through deep belief networks [ 86 ] or bottleneck topologies
[ 27 , 40 ].
Temporal evolution modelling : It has been shown in several chapters of this
book touching all three fields speech, music, and sound that explicit storage of tem-
poral context—in particular with learning the optimal amount of such context—
Search WWH ::




Custom Search