Chain of Audio Processing - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

Chapter 4

Chain of Audio Processing

A complex system that works is invariably found to have evolved

from a simple system that works.

— John Gaule

In the following, let us have a look at the overall process of Intelligent Audio Analysis

as introduced in [ 1 ].

In Fig. 4.1 , a unified overview on a typical Intelligent Audio Analysis system is

given. Its chain of processing is followed in the ongoing, and each component is

described in detail.

Preprocessing : Subsequently to capturing the audio by a single microphone or

array of microphones and digitalising it, the audio is preprocessed. This step usually

aims at enhancement of the audio signal of interest or (blind) separation of individual

sources which are mixed in the captured audio stream. Usually, de-noising is dealt

with in the literature more frequently than de-reverberation that aims at reducing

the influence of varying room impulse responses. Popular (blind) source separation

methods comprise Independent Component Analysis (ICA) [ 2 ] in the case of multiple

microphones/arrays, and Non-Negative Matrix Factorisation (NMF) [ 3 ] in the case

of single microphones (cf. Chap. 8 ) . Popular audio enhancement algorithms include

Wiener filtering and unsupervised spectral subtraction (cf. Chap. 9 ) .

Low Level Descriptor extraction : After the components of interest of the digi-

tal signal have been extracted, parameters must be extracted from the signal which

contain—ideally only—information for a given analysis task but discard other infor-

mation. Such parameters are, e.g., the signal energy and the pitch. Instead of the term

'parameters' we also find the names 'features'. Since audio analysis is mostly based

on short-time analysis, i.e., analysis of short frames of audio, in which we can assume

the signal to be stationary, the specific set of parameters that are extracted at this stage

are called the Low-Level Descriptors (LLDs). This is detailed in Sect. 6.1.2 .

LLDs are extracted at approximately 100 frames per second with typical frame

sizes of 10-30ms. Typically multiple LLD are extracted per frame; we refer to

an LLD (feature) vector in this case. Windowing functions are usually rectangular

Search WWH ::

Custom Search

Home