Audio Enhancement and Robustness - Intelligent Audio Analysis - page 157

Digital Signal Processing Reference

In-Depth Information

Fig. 9.7

AR-SLDS as DBN

s t-2

s t-1

s t-3

s t

structure

h t-2

h t-3

h t-1

h t

v t-3

v t-2

v t-1

v t

η t is not modelling an independent additive noise

source. For the determination of the observed sample at time step t , the vector h t

in the case of the SAR-HMM,

is

projected onto a scalar

v t :

B h t + η t

η t

∼ N(η t

2

v t =

,

with

;

0

,σ

V ),

(9.29)

η t models independent additive white Gaussian noise (AWGN) assumed to

modify the hidden clean sample Bh t . The DBN structure of the SLDS that models

the hidden clean signal and an independent additive noise is found in Fig. 9.7 .

The parameters A

where

(

s t )

, B and

Σ H (

s t )

of the SLDS can be chosen to mimic the

SAR-HMM (cf. Sect. 9.3.3.1 ) for the case

0 a noise

model is included but no training of a new model is needed. With determination of the

exact parameters of the AR-SLDS having a complexity of

σ V =

0[ 17 ]. Likewise, if

σ V =

S T

, the Expectation

Correction (EC) approximation [ 54 ] provides an elegant reduction to

O(

)

.

In practice, the AR-SLDS is particularly suited to cope with white noise dis-

turbance, as the variable

O(

T

)

η t incorporates an AWGN model. It is, however, usually

inferior to frame-level feature-based HMM approaches in clean conditions. This may

be explained by the difference of the approach to human perception which is not per-

formed in the time-domain. In coloured noise environment the AR-SLDS usually

also leads to lower performance than frame-level feature modelling as by SLDMs.

A limitation for practical use is the high computational requirement, even with the

EC algorithm: As an example, for audio at 16 kHz, T is 160 times higher than for a

feature vector sequence operated on 100 FPS.

Obviously, further model architectures exist that were not shown here, but are well

suited to cope with noises, in particular also for non-stationary noise. An example

are the LSTM networks as shown in Sect. 7.2.3.4 [ 55 , 56 ].

Next Page

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home