Digital Signal Processing Reference
In-Depth Information
8.3 NMF Activation Features
Let us nowmove on to describe howNMF can be used directly for audio recognition,
instead of performing signal pre-processing by audio source separation. The core idea
is to use supervised or semi-supervised NMF (cf. above), and then directly exploit the
matrix H for classification. In this case, NMF seeks a minimal-error representation of
the signal (in terms of the cost-function) with only a set of given spectra. As outlined
in Sect. 8.1 ,the H matrix measures the contribution of spectra to the original signal.
Thus, by using a matrix W that contains spectra of different target classes, the rows
of H provide information whether the original signal consists of components of these
target classes. Furthermore, in this framework, additive noise can be modelled by
simply introducing additional NMF components corresponding to noise.
For discrimination of C different audio signal classes c
∈{
1
,...,
C
}
, the matrix
W is built by column-wise concatenation:
W
:=
W 1 |
W 2 |···|
W C |
W N .
where each W c contains 'characteristic' spectra of class c and the optional matrix
W N contains noise spectra. Similarly to the source separation application, there are
a variety of methods for computing W c and W N , such as base learning by NMF as in
the supervised speaker separation example above, or simply by randomly sampling
training spectrograms.
Based on this, NMF activation features can be derived from H . In the example
shown in Fig. 8.4 , an exemplary scheme for static audio classification based on NMF
activations is shown that delivered remarkable performance in discrimination of
linguistic and non-linguistic vocalisations [ 15 ]. In this scheme, it is supposed that
base learning by NMF is used. An activation feature vector a
R is calculated such
that a i is the Euclidean length of the i -th row of H . For independence of the length
and power of the signal, a i is normalised such that
∈ R
1. The 'NMF activation
features' then are the components of the vector a . This vector can be passed on
to a suited classifier, or the activations per class can be summed up to derive class
posteriors. In dynamic classification, e.g., the index of the most likely class per frame
can be used as in [ 14 , 21 ].
Let us now conclude the discussion of audio source separation and feature extrac-
tion by NMF by showing an exemplary application to keyword recognition in highly
non-stationary noise [ 21 ]. This example is based on the CHiME (Computational
Hearing in Multisource Environments) challenge task of recognising command
words in a reverberated indoor domestic environment with multiple noise sources
and interfering speakers [ 22 ].
NMD bases are learnt for each of the 51 words in the vocabulary, and an additional
NMD noise base is computed from a set of noise samples in the training data. Speech
separation is performed in a procedure similar to the speaker separation example
above. Additionally, NMF activation features are computed using a base matrix W
assembled from spectrogram 'patches' in the training data, in a 'sliding window
NMF' framework (cf. above) with T
|
a
| 1 =
=
20. As each speech spectrogram patch is
 
 
Search WWH ::




Custom Search