Audio Source Separation - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

8.3 NMF Activation Features

Let us nowmove on to describe howNMF can be used directly for audio recognition,

instead of performing signal pre-processing by audio source separation. The core idea

is to use supervised or semi-supervised NMF (cf. above), and then directly exploit the

matrix H for classification. In this case, NMF seeks a minimal-error representation of

the signal (in terms of the cost-function) with only a set of given spectra. As outlined

in Sect. 8.1 ,the H matrix measures the contribution of spectra to the original signal.

Thus, by using a matrix W that contains spectra of different target classes, the rows

of H provide information whether the original signal consists of components of these

target classes. Furthermore, in this framework, additive noise can be modelled by

simply introducing additional NMF components corresponding to noise.

For discrimination of C different audio signal classes c

∈{

1

,...,

C

}

, the matrix

W is built by column-wise concatenation:

W

:=

W 1 |

W 2 |···|

W C |

W N .

where each W c contains 'characteristic' spectra of class c and the optional matrix

W N contains noise spectra. Similarly to the source separation application, there are

a variety of methods for computing W c and W N , such as base learning by NMF as in

the supervised speaker separation example above, or simply by randomly sampling

training spectrograms.

Based on this, NMF activation features can be derived from H . In the example

shown in Fig. 8.4 , an exemplary scheme for static audio classification based on NMF

activations is shown that delivered remarkable performance in discrimination of

linguistic and non-linguistic vocalisations [ 15 ]. In this scheme, it is supposed that

base learning by NMF is used. An activation feature vector a

R is calculated such

that a i is the Euclidean length of the i -th row of H . For independence of the length

and power of the signal, a i is normalised such that

∈ R

1. The 'NMF activation

features' then are the components of the vector a . This vector can be passed on

to a suited classifier, or the activations per class can be summed up to derive class

posteriors. In dynamic classification, e.g., the index of the most likely class per frame

can be used as in [ 14 , 21 ].

Let us now conclude the discussion of audio source separation and feature extrac-

tion by NMF by showing an exemplary application to keyword recognition in highly

non-stationary noise [ 21 ]. This example is based on the CHiME (Computational

Hearing in Multisource Environments) challenge task of recognising command

words in a reverberated indoor domestic environment with multiple noise sources

and interfering speakers [ 22 ].

NMD bases are learnt for each of the 51 words in the vocabulary, and an additional

NMD noise base is computed from a set of noise samples in the training data. Speech

separation is performed in a procedure similar to the speaker separation example

above. Additionally, NMF activation features are computed using a base matrix W

assembled from spectrogram 'patches' in the training data, in a 'sliding window

NMF' framework (cf. above) with T

|

a

| 1 =

=

20. As each speech spectrogram patch is

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home