Chain of Audio Processing - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

Intelligent Audio Analysis on which types of classifiers or regressors are used, partly

owing to the diverse requirements arising from the variety of tasks (cf. Chap. 7 ) .

Fusion : (optional): This stage exists if information is fused on the 'late semantic'

level rather than on early feature level (cf., e.g., [ 11 ]).

Encoding : (optional): Once the final decision is made, the information needs to

be represented in an optimal way for system integration such as a music or sound

search or spoken language dialogue system [ 12 ]. Here, standards may be employed to

ensure utmost re-usability such as VoiceXML, Extensible MultiModal Annotation

markup language (EMMA) [ 13 ], Emotion Markup Language (EmotionML) [ 14 ],

Multimodal Interaction Markup Language (MIML) [ 15 ], ID3 tags, etc. Additional

information such as confidences can reasonably be added to allow for disambiguation

strategies or similar.

Audio databases : They comprise the stored audio of exemplary speech, music,

and sound for model learning and evaluation. In addition, a transcription of the spoken

content or note events, etc., may be given and/or the labelling of further target tasks.

Acoustic model (AM) : consists of the learnt dependencies between acoustic

observations and classes, or continuous values in the case of regression.

Language model (LM) : stores the learnt dependencies of linguistic observations

and according assignments.

In the following, all these steps (except for fusion and encoding) will be explained

in detail (remaining Part II), then practical applications are shown (Part III).

References

1. Schuller, B.: Voice and speech analysis in search of states and traits. In: Salah, A.A., Gevers, T.

(eds.) Computer Analysis of Human Behavior, Advances in Pattern Recognition, chapter 9,

pp. 227-253. Springer, Berlin (2011)

2. Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, New York

(2001)

3. Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.-I.: Nonnegative Matrix and Tensor Factoriza-

tions. Wiley, Chichester (2009)

4. Batliner, A., Seppi, D., Steidl, S., Schuller, B.: Segmenting into adequate units for auto-

matic recognition of emotion-related episodes: a speech-based approach. Advances in Human

Computer Interaction, Special Issue on Emotion-Aware Natural Interaction, 2010(Article ID

782802), 1-15 (2010)

5. Pachet, F., Roy, P.: Analytical features: a knowledge-based approach to audio feature generation.

EURASIP J. Audio Speech Music Process. 1 , 1-23 (2009)

6. Schuller, B., Wimmer, M., Mösenlechner, L., Kern, C., Arsic, D., Rigoll, G.: Brute-forcing

hierarchical functionals for paralinguistics: a waste of feature space? In: Proceedings 33rd

IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2008,

pp. 4501-4504. IEEE, Las Vegas (2008)

7. Eyben, F., Wöllmer, M., Schuller, B.: Opensmile—the munich versatile and fast open-source

audio feature extractor. In: Proceedings of the 9th ACM International Conference on Multime-

dia, MM 2010, pp. 1459-1462. ACM, Florence (2010)

8. Jolliffe, I.T.: Principal Component Analysis. Springer, Berlin (2002)

9. Pudil, P., Novovicova, J., Kittler, J.: Floating search methods in feature selection. Pattern

Recogn. Lett. 15 , 1119-1125 (1994)

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home