Digital Signal Processing Reference
In-Depth Information
(a)
(b)
(c)
Fig. 8.3 Benchmark results for monaural speaker separation by supervised NMF, in terms of RTF
and signal-to-distortion ratio (SDR) [ 20 ]. Mixed signals from pairs of male/female speakers (24
speakers total) from the TIMIT database. The open-source openBliSSART toolkit is used, and
computation is performed on a consumer grade GPU (NVIDIA GeForce GTX 560). The number
of NMF iterations (20-320), the DFT window size (16, 64, 256ms) and the NMF cost function are
adjusted. a Euclidean distance. b KL divergence. c Itakura-saito divergence
randomly selected sentences of roughly equal length were mixed, and a NMF basis
W was computed from the other sentences spoken by each speaker. To this end,
unsupervised NMF (250 iterations) was applied to the concatenated spectrograms
of these sentences and only the first factor was kept. Separated signals for both
speakers were obtained by supervised NMF with W , by summing up component
spectra corresponding to either speaker, and applying inverse STFT as discussed
above. Computations base on a 2.4GHz desktop PC with 4GB of RAM, using a
consumer grade GPU (NVIDIA GeForce GTX 560) with 336 CUDA cores. The
NMF implementation from the open-source toolkit openBliSSART [ 12 ] is used.
RTFs are computed by the elapsed GPU time over the length of the mixed signals.
The number of separation iterations was chosen from {20, 40, 80, 160, 320} due to
the quick saturation of the convergence of multiplicative update NMF algorithms in
audio source separation [ 9 ]. The different DFT window sizes considered are powers
of two, ranging from 2 6 to 2 12 , or 8-256ms assuming 16kHz sample rate. From
Fig. 8.3 , it can be seen that the best average results are obtained by using the KL
divergence as cost function. The Euclidean distance allows faster separation at the
expense of quality, but here, reasonable results are only achieved for long window
sizes (256ms), which limits the practical applicability in contexts where real-time
operation is required. Finally, the IS divergence enables robust separation, but is
inferior to KL divergence both in terms of separation quality and RTF. Generally,
it can be observed that in case of inadequate modeling of the sources (indicated by
overall low SDR), more iterations do not necessarily improve separation quality,
despite the fact that they linearly increase computational complexity; in fact, more
iterations sometimes degrade quality, e.g., for the Euclidean cost function and 16 or
64ms window size.
 
 
Search WWH ::




Custom Search