Digital Signal Processing Reference
In-Depth Information
Intelligent Audio Analysis on which types of classifiers or regressors are used, partly
owing to the diverse requirements arising from the variety of tasks (cf. Chap. 7 ) .
Fusion : (optional): This stage exists if information is fused on the 'late semantic'
level rather than on early feature level (cf., e.g., [ 11 ]).
Encoding : (optional): Once the final decision is made, the information needs to
be represented in an optimal way for system integration such as a music or sound
search or spoken language dialogue system [ 12 ]. Here, standards may be employed to
ensure utmost re-usability such as VoiceXML, Extensible MultiModal Annotation
markup language (EMMA) [ 13 ], Emotion Markup Language (EmotionML) [ 14 ],
Multimodal Interaction Markup Language (MIML) [ 15 ], ID3 tags, etc. Additional
information such as confidences can reasonably be added to allow for disambiguation
strategies or similar.
Audio databases : They comprise the stored audio of exemplary speech, music,
and sound for model learning and evaluation. In addition, a transcription of the spoken
content or note events, etc., may be given and/or the labelling of further target tasks.
Acoustic model (AM) : consists of the learnt dependencies between acoustic
observations and classes, or continuous values in the case of regression.
Language model (LM) : stores the learnt dependencies of linguistic observations
and according assignments.
In the following, all these steps (except for fusion and encoding) will be explained
in detail (remaining Part II), then practical applications are shown (Part III).
References
1. Schuller, B.: Voice and speech analysis in search of states and traits. In: Salah, A.A., Gevers, T.
(eds.) Computer Analysis of Human Behavior, Advances in Pattern Recognition, chapter 9,
pp. 227-253. Springer, Berlin (2011)
2. Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, New York
(2001)
3. Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.-I.: Nonnegative Matrix and Tensor Factoriza-
tions. Wiley, Chichester (2009)
4. Batliner, A., Seppi, D., Steidl, S., Schuller, B.: Segmenting into adequate units for auto-
matic recognition of emotion-related episodes: a speech-based approach. Advances in Human
Computer Interaction, Special Issue on Emotion-Aware Natural Interaction, 2010(Article ID
782802), 1-15 (2010)
5. Pachet, F., Roy, P.: Analytical features: a knowledge-based approach to audio feature generation.
EURASIP J. Audio Speech Music Process. 1 , 1-23 (2009)
6. Schuller, B., Wimmer, M., Mösenlechner, L., Kern, C., Arsic, D., Rigoll, G.: Brute-forcing
hierarchical functionals for paralinguistics: a waste of feature space? In: Proceedings 33rd
IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2008,
pp. 4501-4504. IEEE, Las Vegas (2008)
7. Eyben, F., Wöllmer, M., Schuller, B.: Opensmile—the munich versatile and fast open-source
audio feature extractor. In: Proceedings of the 9th ACM International Conference on Multime-
dia, MM 2010, pp. 1459-1462. ACM, Florence (2010)
8. Jolliffe, I.T.: Principal Component Analysis. Springer, Berlin (2002)
9. Pudil, P., Novovicova, J., Kittler, J.: Floating search methods in feature selection. Pattern
Recogn. Lett. 15 , 1119-1125 (1994)
 
Search WWH ::




Custom Search