Discussion - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

outperforms common approaches in the field that do not provide this option.

The LSTM architectures shown are clearly suited in this respect, and whenever the

application allows for modelling of temporal context dependencies, these or future

alternatives should be considered—be it on their own or in combination with other

machine learning algorithms to provide them with dynamic warping abilities such

as the shown tandem DBN-BLSTM architectures.

Coupling of tasks :[ 84 ] A number of interdependencies is already visible in the

tasks that were considered in this topic. By addition of further or novel tasks, this

dependency is likely to be amplified. For example, in speaker analysis, long term

traits are coupled to some degree, e.g., height with age, gender, and race as were

shown interdependent in the determination of age, gender, and height in this topic.

Other examples include emotional manifestation being dependent on personality

[ 96 ], and gender-dependencies of non-linguistic vocalisations such as laughter [ 97 ].

In this topic an example was also shown in the music domain for the interdepen-

dence of ballroom dance style, metre, and tempo. It seems obvious that the further

introduced sound analysis tasks of event and evoked emotion are also interdependent.

Such knowledge can be integrated by keeping separate models depending on the

other tasks, adaptation or normalisation, or considering additional information on

related 'side tasks' [ 98 , 99 ] as in the above named examples shown in this topic. An

alternative to such explicit modelling of dependencies is to automatically learn them

from training data. For example, the rather simple strategy of using pairs of age and

gender classes as learning target instead of each attribute individually was shown to

be beneficial in this topic and [ 47 ]. In the future, enhanced modelling of multiple

correlated target variables should be commonly targeted in multi-task learning. The

input features are then shared among tasks, such as the internal activations in the

hidden layer of a neural network [ 100 ]. A challenge may then arise from the different

representation of task variables by various data types (continuous, ordinal, nominal),

which may additionally use different time scales (e.g., dance style is mostly constant

in a musical piece, but tempo may vary). Considering such suited methods for multi-

scale fusion and multi-task learning [ 101 ], future Intelligent Audio Analysis should

not focus on tasks in isolation, but aim at a 'more holistic' analysis of tasks.

Standardisation : Arguably, the more mature and closer to real-life application

the field of Intelligent Audio Analysis gets, the greater is the need for standardisation

[ 84 ]. Similarly as before, standardisation efforts can be categorised along the signal

processing chain. They include definition of the task modelling such as given in the

MPEG-4 standard for emotion in audio or the MIREX tasks and ID3 tag categories

for music, documentation and well-motivated grouping of audio features such as

the CEICES Feature Coding Scheme [ 102 ], standardised feature sets as provided

by the openSMILE [ 72 ] and openEAR [ 73 ] toolkits or MPEG7-LLD standard, and

machine learning frameworks [ 103 ]. Such standardised feature extraction and clas-

sification allows to evaluate the feature extraction and classification components of

a recognition system separately. To further increase the reproducibility and compa-

rability of results, well-defined evaluation settings should be employed, such as the

ones provided by the named challenge events [ 12 ]. Finally, communication between

system components in real-life applications requires standardisation of recognition

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home