Digital Signal Processing Reference
In-Depth Information
Chapter 4
Chain of Audio Processing
A complex system that works is invariably found to have evolved
from a simple system that works.
John Gaule
In the following, let us have a look at the overall process of Intelligent Audio Analysis
as introduced in [ 1 ].
In Fig. 4.1 , a unified overview on a typical Intelligent Audio Analysis system is
given. Its chain of processing is followed in the ongoing, and each component is
described in detail.
Preprocessing : Subsequently to capturing the audio by a single microphone or
array of microphones and digitalising it, the audio is preprocessed. This step usually
aims at enhancement of the audio signal of interest or (blind) separation of individual
sources which are mixed in the captured audio stream. Usually, de-noising is dealt
with in the literature more frequently than de-reverberation that aims at reducing
the influence of varying room impulse responses. Popular (blind) source separation
methods comprise Independent Component Analysis (ICA) [ 2 ] in the case of multiple
microphones/arrays, and Non-Negative Matrix Factorisation (NMF) [ 3 ] in the case
of single microphones (cf. Chap. 8 ) . Popular audio enhancement algorithms include
Wiener filtering and unsupervised spectral subtraction (cf. Chap. 9 ) .
Low Level Descriptor extraction : After the components of interest of the digi-
tal signal have been extracted, parameters must be extracted from the signal which
contain—ideally only—information for a given analysis task but discard other infor-
mation. Such parameters are, e.g., the signal energy and the pitch. Instead of the term
'parameters' we also find the names 'features'. Since audio analysis is mostly based
on short-time analysis, i.e., analysis of short frames of audio, in which we can assume
the signal to be stationary, the specific set of parameters that are extracted at this stage
are called the Low-Level Descriptors (LLDs). This is detailed in Sect. 6.1.2 .
LLDs are extracted at approximately 100 frames per second with typical frame
sizes of 10-30ms. Typically multiple LLD are extracted per frame; we refer to
an LLD (feature) vector in this case. Windowing functions are usually rectangular
 
Search WWH ::




Custom Search