Audio Source Separation - Intelligent Audio Analysis - page 139

Digital Signal Processing Reference

In-Depth Information

(a)

(b)

(c)

Fig. 8.3 Benchmark results for monaural speaker separation by supervised NMF, in terms of RTF

and signal-to-distortion ratio (SDR) [ 20 ]. Mixed signals from pairs of male/female speakers (24

speakers total) from the TIMIT database. The open-source openBliSSART toolkit is used, and

computation is performed on a consumer grade GPU (NVIDIA GeForce GTX 560). The number

of NMF iterations (20-320), the DFT window size (16, 64, 256ms) and the NMF cost function are

adjusted. a Euclidean distance. b KL divergence. c Itakura-saito divergence

randomly selected sentences of roughly equal length were mixed, and a NMF basis

W was computed from the other sentences spoken by each speaker. To this end,

unsupervised NMF (250 iterations) was applied to the concatenated spectrograms

of these sentences and only the first factor was kept. Separated signals for both

speakers were obtained by supervised NMF with W , by summing up component

spectra corresponding to either speaker, and applying inverse STFT as discussed

above. Computations base on a 2.4GHz desktop PC with 4GB of RAM, using a

consumer grade GPU (NVIDIA GeForce GTX 560) with 336 CUDA cores. The

NMF implementation from the open-source toolkit openBliSSART [ 12 ] is used.

RTFs are computed by the elapsed GPU time over the length of the mixed signals.

The number of separation iterations was chosen from {20, 40, 80, 160, 320} due to

the quick saturation of the convergence of multiplicative update NMF algorithms in

audio source separation [ 9 ]. The different DFT window sizes considered are powers

of two, ranging from 2 6 to 2 12 , or 8-256ms assuming 16kHz sample rate. From

Fig. 8.3 , it can be seen that the best average results are obtained by using the KL

divergence as cost function. The Euclidean distance allows faster separation at the

expense of quality, but here, reasonable results are only achieved for long window

sizes (256ms), which limits the practical applicability in contexts where real-time

operation is required. Finally, the IS divergence enables robust separation, but is

inferior to KL divergence both in terms of separation quality and RTF. Generally,

it can be observed that in case of inadequate modeling of the sources (indicated by

overall low SDR), more iterations do not necessarily improve separation quality,

despite the fact that they linearly increase computational complexity; in fact, more

iterations sometimes degrade quality, e.g., for the Euclidean cost function and 16 or

64ms window size.

Next Page

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home