Information Technology Reference
In-Depth Information
potentialities of an integrated approach to music
description. In order to solve the third point, all
present approaches to score-to audio synchroni-
zation proceed in two stages: in the first stage,
suitable parameters are extracted from the score
and audio data streams making them comparable;
in the second stage, an optimal alignment is com-
puted by means of dynamic programming (DP)
based on a suitable local distance measure.
Turetsky et al. [§7], first to convert the score
data (given in MIDI format) into an audio data
stream using a synthesizer. Then, the two audio
data streams are analyzed by means of a short-time
Fourier transform (STFT) which in turn yields a
sequence of suitable feature vectors.
Based on an adequate local distance measure
permitting a pairwise comparison of these feature
vectors, the best alignment is derived by means
of DTW. The approach of Soulez, Rodet, and
Schwarz (2003) is similar to Turetsky et al. [§7]
with one fundamental difference: In Turetsky et
al. [§7], the score data is first converted into the
much more complex audio format—in the actual
synchronization step the explicit knowledge of
note parameters is not used. Contrary to Soulez et
al. (2003) who explicitly uses note parameters such
as onset times and pitches to generate a sequence
of attack, sustain and silence models which are
used in the synchronization process. This results
in a more robust algorithm with respect to local
time deviations and small spectral variations.
Since the STFT is used for the analysis of the
audio data stream, both approaches have the fol-
lowing drawbacks:
Firstly, the STFT computes spectral coeffi-
cients which are linearly spread over the spectrum
resulting in a bad low-frequency resolution. There-
fore, one has to rely on the harmonics in the case
of low notes. This is problematic in polyphonic
music where harmonics and fundamental frequen-
cies of different notes often coincide. Secondly, in
order to obtain a sufficient time resolution one has
to work with a relatively large number of feature
vectors on the audio side. (For example, even with
a rough time resolution of 46 ms as suggested in
Turetsky et al. [§7] more than 20 feature vec-
tors per second are required.) This leads to huge
memory requirements as well as long running
times in the DTW computation.
In the approach of Arifi (2004), note parameters
such as onset times and pitches are extracted from
the audio data stream (piano music). The alignment
process is then performed in the score-like domain
by means of a suitably designed cost measure on
the note level. Due to the expressiveness of such
note parameters only a small number of features
is sufficient to solve the synchronization task, al-
lowing for a more efficient alignment. One major
drawback of this approach is that the extraction
of score-like note parameters from the audio
data—a kind of music transcription—constitutes
a difficult and time-consuming problem, pos-
sibly leading to many wrongly extracted audio
features. This makes the subsequent alignment
step a delicate task.
Muller, Kurth, and Roder (2004) present an
algorithm, which solves the synchronization
problem accurately and efficiently for complex,
polyphonic piano music. In a first step, they extract
from the audio data stream a set of highly expres-
sive features encoding note onset candidates
separately for all pitches. This makes computa-
tions efficient since only a small number of such
features are sufficient to solve the synchroniza-
tion task. Based on a suitable matching model,
the best match between the score and the feature
parameters is computed by dynamic programming
(DP). To further cut down the computational cost
in the synchronization process, they introduce
the concept of anchor matches, matches which
can be easily established. Then the DP-based
technique is locally applied between adjacent
anchor matches.
references
Bollobás Béla (1998). Modern graph theory . New
York: Springer-Verlag.
Search WWH ::




Custom Search