Digital Signal Processing Reference
In-Depth Information
Fig. 4.1 Unified overview of typical Intelligent Audio Analysis systems. Dotted boxes indicate
optional components. Dashed lines indicate steps carried out only during system training or adap-
tation phases, where s ( k )
, x , y are the audio signal, feature vector and target (vector), respectively,
high comma indicates altered versions and subscripts indicate diverse vectors. The connection from
classification or regression back to the audio database indicates active and semi-supervised or unsu-
pervised learning. The fusion block allows for integration of other signals by late 'semantic' fusion
for extraction of a Low Level Descriptor (LLD) in the time domain and smooth
(e.g., Hamming or Hann) for extraction in the frequency or time-frequency (TF, e.g.,
Gaussian or general wavelets) domains. To compensate artefacts introduced by the
windowing function, typically a smoothing of the LLD with a moving average filter
of 3 frames length is done.
Many systems process features on the LLD level (also referred to as frame level)
directly, either to provide a frame-by-frame estimate, or by sliding windows of fea-
ture vectors of fixed length, or by dynamical approaches that provide some sort of
temporal alignment and warping such as HiddenMarkovModels (HMMs) or general
Dynamic Bayesian Networks (DBNs).
Typical audio LLDs cover: intonation (pitch, etc.), intensity (energy, etc.), Lin-
ear Prediction Cepstral Coefficients (LPCCs), Perceptual Linear Prediction (PLP),
Cepstral Coefficients (MFCCs, etc.), formants (amplitude, position, width, etc.),
spectrum (Mel Frequency Bands (MFBs), NMF-based components, MPEG-7 audio,
roll-off, etc.), harmonicity (Harmonics-to-Noise Ratio (HNR), Noise-to-Harmonics
Ratio (NHR), etc.), perturbation (jitter, shimmer, etc.), pitch class profiles, etc.
Note that one can also introduce string-type LLDs to describe, e.g., linguistic
content. Their extraction usually requires chunking and speech recognition or similar.
Chunking : In most applications intelligent audio analysis algorithms have to con-
sider longer segments of audio, as attributes such as emotion, speaking style, music
mood, instruments, musical chord progression, or general sound events are charac-
terised by the dynamics of the signal over time. Depending on the task, the right seg-
ment of analysis, i.e., the chunking, has to be found. Methods for chunking comprise:
choosing a fixed number of frames, acoustic chunking (e.g., by Bayesian Information
Criterion), voiced/unvoiced parts, and for speech units such as phonemes, syllables,
words, or sub-turns in the sense of syntactically or semantically motivated chunkings
 
Search WWH ::




Custom Search