Digital Signal Processing Reference
In-Depth Information
outperforms common approaches in the field that do not provide this option.
The LSTM architectures shown are clearly suited in this respect, and whenever the
application allows for modelling of temporal context dependencies, these or future
alternatives should be considered—be it on their own or in combination with other
machine learning algorithms to provide them with dynamic warping abilities such
as the shown tandem DBN-BLSTM architectures.
Coupling of tasks :[ 84 ] A number of interdependencies is already visible in the
tasks that were considered in this topic. By addition of further or novel tasks, this
dependency is likely to be amplified. For example, in speaker analysis, long term
traits are coupled to some degree, e.g., height with age, gender, and race as were
shown interdependent in the determination of age, gender, and height in this topic.
Other examples include emotional manifestation being dependent on personality
[ 96 ], and gender-dependencies of non-linguistic vocalisations such as laughter [ 97 ].
In this topic an example was also shown in the music domain for the interdepen-
dence of ballroom dance style, metre, and tempo. It seems obvious that the further
introduced sound analysis tasks of event and evoked emotion are also interdependent.
Such knowledge can be integrated by keeping separate models depending on the
other tasks, adaptation or normalisation, or considering additional information on
related 'side tasks' [ 98 , 99 ] as in the above named examples shown in this topic. An
alternative to such explicit modelling of dependencies is to automatically learn them
from training data. For example, the rather simple strategy of using pairs of age and
gender classes as learning target instead of each attribute individually was shown to
be beneficial in this topic and [ 47 ]. In the future, enhanced modelling of multiple
correlated target variables should be commonly targeted in multi-task learning. The
input features are then shared among tasks, such as the internal activations in the
hidden layer of a neural network [ 100 ]. A challenge may then arise from the different
representation of task variables by various data types (continuous, ordinal, nominal),
which may additionally use different time scales (e.g., dance style is mostly constant
in a musical piece, but tempo may vary). Considering such suited methods for multi-
scale fusion and multi-task learning [ 101 ], future Intelligent Audio Analysis should
not focus on tasks in isolation, but aim at a 'more holistic' analysis of tasks.
Standardisation : Arguably, the more mature and closer to real-life application
the field of Intelligent Audio Analysis gets, the greater is the need for standardisation
[ 84 ]. Similarly as before, standardisation efforts can be categorised along the signal
processing chain. They include definition of the task modelling such as given in the
MPEG-4 standard for emotion in audio or the MIREX tasks and ID3 tag categories
for music, documentation and well-motivated grouping of audio features such as
the CEICES Feature Coding Scheme [ 102 ], standardised feature sets as provided
by the openSMILE [ 72 ] and openEAR [ 73 ] toolkits or MPEG7-LLD standard, and
machine learning frameworks [ 103 ]. Such standardised feature extraction and clas-
sification allows to evaluate the feature extraction and classification components of
a recognition system separately. To further increase the reproducibility and compa-
rability of results, well-defined evaluation settings should be employed, such as the
ones provided by the named challenge events [ 12 ]. Finally, communication between
system components in real-life applications requires standardisation of recognition
Search WWH ::




Custom Search