Audio Enhancement and Robustness - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

such audio. However, noise frames are assumed to be independent over time. As

a consequence, non-stationary noises are not modelled adequately. Even with the

restrictions made in the GPB algorithm, feature enhancement by SLDM is compu-

tationally more demanding than the techniques discussed above. Further, as in the

AFE (cf. Sect. 9.1 ), accurate audio activity detection is required to provide correct

estimation of the noise LDM.

9.3 Model Architectures

The most frequently used data-driven model representation of audio are HMMs

[ 14 ]. Beyond the so far described optimisation options along the chain of Intelligent

Audio Analysis, extending HMM topologies to more general DBN layouts can also

help to increase noise robustness [ 15 , 17 , 44 ]. Generative models such as HMMs

assume conditional independence of the audio feature observations, thus ignoring

long-range dependencies as given in most audio of interest [ 45 ]. To overcome this,

Conditional Random Fields (CRF) [ 46 - 48 ] model a sequence by an exponential

distribution given the observation sequence. The HCRF [ 15 , 49 ] further includes

hidden state sequences for the estimation of the conditional probability of a class

over an entire sequence. Another interesting option is to model the raw audio signal

in the time domain [ 16 ]. For example, SAR-HMM [ 16 ] provide good results in clean

audio conditions. To cope with noise, these can be extended to a Switching Linear

Dynamical System (SLDS) [ 17 ] to model the dynamics of the raw audio signal and

the noise. These alternatives will now be shortly presented.

9.3.1 Conditional Random Fields

As mentioned above, CRF [ 46 ] use an exponential distribution to model a sequence

given its observation and by that also non-local dependencies among states and

observations. Further, unnormalised transition probabilities are possible. Owing to

the ability to enforce a Markov assumption as in HMMs, dynamic programming is

applicable for inference. CRFs were also shown beneficial as LM [ 50 ].

9.3.2 Hidden Conditional Random Fields

An extension to HCRF is needed to make the CRF paradigm suited for general audio

recognition tasks. This comes, as CRF provide a class prediction per observation

and frame of a time sequence rather than for an entire sequence. HCRF overcome

this by adding hidden state sequences [ 49 ]. Reports of superiority over HMM in

the Intelligent Audio Analysis domain include the recognition of phones [ 15 ] and

Search WWH ::

Custom Search

Home