Real-Time Detection of Overlapping Sound Events with Non-Negative Matrix Factorization - Matrix Information Geometry

Digital Signal Processing Reference

In-Depth Information

14.2.2 Applications to the Detection of Overlapping

Sound Events

NMF algorithms have been applied to various problems in computer vision, signal

processing, biomedical data analysis and text classification among others [ 25 ]. In the

context of sound processing, the matrix V is in general a time-frequency representa-

tion of the sound to analyze. The rows and columns represent respectively different

frequency bins and successive time-frames. The factorization v j ≈ i h ij w i can

then be interpreted as follows: each basis vector w i contains a spectral template, and

the decomposition coefficients h ij represent the activations of the i -th template at

the j -th time-frame.

Concerning the detection of overlapping sound events, NMF has been widely used

in off-line systems for polyphonic music transcription, where the sound events cor-

respond roughly to notes (e.g., see [ 26 , 27 ]). Several problem-dependent extensions

have been developed to provide controls on NMF in this context, such as a source-

filter model [ 28 ], an harmonic constraint [ 29 ], a selective sparsity regularization [ 30 ],

or a subspace model of basis instruments [ 31 ]. Most of these systems consider either

the standard Euclidean cost or the Kullback-Leibler divergence. Recent works yet

have investigated the use of other cost functions such as the Itakura-Saito divergence

[ 32 - 35 ] or the more general parametric beta-divergence [ 17 ].

Some authors have also used non-negative decomposition for sound event detec-

tion. A real-time system to identify the presence and determine the pitch of one or

more voices is proposed in [ 4 ] and is adapted to sight-reading evaluation of solo

instrument in [ 5 ]. Concerning automatic transcription, off-line systems are used in

[ 6 ] for drum transcription and in [ 7 ] for polyphonic music transcription. A real-

time system for polyphonic music transcription is also proposed in [ 8 ] and is further

developed in [ 9 ] for real-time coupled multiple-pitch and multiple-instrument recog-

nition. All these systems consider either the Euclidean or the Kullback-Leibler cost

function, and only the latter provides a control on the decomposition by enforcing

the solutions to have a fixed desired sparsity.

Other approaches in the framework of probabilistic models with latent variables

also share common perspectives with NMF techniques [ 36 ]. In this framework, the

non-negative data are considered as a discrete distribution and are factorized into a

mixture model where each latent component represents a source. It can then be shown

that maximum likelihood estimation of the mixture parameters amounts to NMF with

the Kullback-Leibler divergence, and that the classical expectation-maximization

algorithm is equivalent to the multiplicative updates scheme. Considering the prob-

lem in a probabilistic framework is however convenient for enhancing the standard

model and adding regularization terms through priors and maximum a posteriori

estimation instead of maximum likelihood estimation. In particular, the framework

has been employed in polyphonic music transcription to include shift-invariance and

sparsity [ 37 ]. Recent works have extended the later model to include a temporal

smoothing and a unimodal prior for the impulse distributions [ 38 ], a hierarchical

Search WWH ::

Custom Search

Home