Digital Signal Processing Reference
In-Depth Information
idealised audio material. After this step, systematic feature brute-forcing of up to
thousands of audio features was shown as highly efficient means in particular also in
the case of handling of novel Intelligent Audio Analysis tasks. This was seen in many
of the presented tasks [ 10 - 12 ]. At the same time, individually tailored feature types
were shown, in particular such basing on NMF activations [ 4 - 6 , 13 - 16 ]ormusic
theory and human perception [ 17 ]. The idea is to demonstrate limitations of unifi-
cation. Further, memory-enhanced learning algorithms such as by (B) LSTM RNNs
[ 3 , 16 , 18 - 30 ] and their synergistic combination with DBNs [ 5 , 31 - 42 ] were shown
to prevail in many tasks. For example, two MIREX 2010 Challenges for music onset
detection [ 22 ] and tempo determination [ 21 ] were won by this method. Then, suited
GM topologies such as SLDM and SLDS were presented in their successful appli-
cation for highly noise robust ASR [ 43 ]. In their combination, the overall efforts led
to the best result in the CHiME 2011 Challenge for highly robust keyword spotting
when using only a single microphone source [ 5 ]. Subsequent to the Challenge, the
overall best result—beating also those approaches that exploit multiple microphone
sources—could be reached based on combining the presented approaches towards
source separation with NMF activation features, and a triple-stream topology of a
DBNwith BLSTMRNN feed [ 6 ]. To ease the ever-present bottleneck of data sparse-
ness, a series of methods was further suggested and shown to be beneficial. These
can be added by synthesis of training material [ 44 ], and the collaboration of machine
and human for the labelling of data guided by the machine: The machine first by
itself labels the data it is sufficiently confident it can assign the correct label itself in
a semi-supervised learning step [ 45 , 46 ]. Then, it asks for human's help if it cannot
assign a label with sufficient confidence, but thinks the data may be interesting, for
example because it covers a sparse class. This is the 'active learning' step. Finally, it
decides that some instances might not be of interest as they are too similar to already
seen data in an active learning step. Further, transfer learning methods can help to
use data with 'similar' conditions.
(III) The very broad range of Intelligent Audio Analysis application was shown.
These include the recognition of speaker states and traits such as age [ 47 ], height
[ 48 ], interest [ 27 , 49 - 54 ], intoxication [ 55 , 56 ], and sleepinesss [ 12 , 57 ], singer traits
in polyphonic music [ 2 , 3 , 58 ] such as age, gender, height, and race, the recognition
of ballroom dance style [ 59 - 61 ] in music, or the recognition of emotion evoked in
the listener of sounds [ 62 ]—to name the most recent ones of the examples.
(IV) Benchmark results and standardised test-beds were shown for a broader range
of audio analysis tasks. Especially in the field of paralinguistic speech analysis these
were entirely lacking until very recently. Instead, comparability between research
results in the field was considerably low. Apart from different evaluation strategies,
the diversity of corpora is high, as many early studies report results on their individual
and proprietary corpora. Additionally, there was practically no same feature set found
twice: High diversity is not only found in the selection of LLD, but also in the per-
ceptual adaptation, speaker adaptation, and—most of all—selection and implemen-
tation of functionals. This opposes the more or less settled and clearly defined feature
types MFCC, RASTA or PLP that allow for higher comparability in speech recog-
nition. A series of consecutive annual research challenges held at INTERSPEECH
Search WWH ::




Custom Search