Chain of Audio Processing - Intelligent Audio Analysis - page 15

Digital Signal Processing Reference

In-Depth Information

Fig. 4.1 Unified overview of typical Intelligent Audio Analysis systems. Dotted boxes indicate

optional components. Dashed lines indicate steps carried out only during system training or adap-

tation phases, where s ( k )

, x , y are the audio signal, feature vector and target (vector), respectively,

high comma indicates altered versions and subscripts indicate diverse vectors. The connection from

classification or regression back to the audio database indicates active and semi-supervised or unsu-

pervised learning. The fusion block allows for integration of other signals by late 'semantic' fusion

for extraction of a Low Level Descriptor (LLD) in the time domain and smooth

(e.g., Hamming or Hann) for extraction in the frequency or time-frequency (TF, e.g.,

Gaussian or general wavelets) domains. To compensate artefacts introduced by the

windowing function, typically a smoothing of the LLD with a moving average filter

of 3 frames length is done.

Many systems process features on the LLD level (also referred to as frame level)

directly, either to provide a frame-by-frame estimate, or by sliding windows of fea-

ture vectors of fixed length, or by dynamical approaches that provide some sort of

temporal alignment and warping such as HiddenMarkovModels (HMMs) or general

Dynamic Bayesian Networks (DBNs).

Typical audio LLDs cover: intonation (pitch, etc.), intensity (energy, etc.), Lin-

ear Prediction Cepstral Coefficients (LPCCs), Perceptual Linear Prediction (PLP),

Cepstral Coefficients (MFCCs, etc.), formants (amplitude, position, width, etc.),

spectrum (Mel Frequency Bands (MFBs), NMF-based components, MPEG-7 audio,

roll-off, etc.), harmonicity (Harmonics-to-Noise Ratio (HNR), Noise-to-Harmonics

Ratio (NHR), etc.), perturbation (jitter, shimmer, etc.), pitch class profiles, etc.

Note that one can also introduce string-type LLDs to describe, e.g., linguistic

content. Their extraction usually requires chunking and speech recognition or similar.

Chunking : In most applications intelligent audio analysis algorithms have to con-

sider longer segments of audio, as attributes such as emotion, speaking style, music

mood, instruments, musical chord progression, or general sound events are charac-

terised by the dynamics of the signal over time. Depending on the task, the right seg-

ment of analysis, i.e., the chunking, has to be found. Methods for chunking comprise:

choosing a fixed number of frames, acoustic chunking (e.g., by Bayesian Information

Criterion), voiced/unvoiced parts, and for speech units such as phonemes, syllables,

words, or sub-turns in the sense of syntactically or semantically motivated chunkings

Next Page

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home