Information Technology Reference
In-Depth Information
instrumental segments in frequency sub-bands
above 1 kHz, with the largest difference in the
highest frequencies, where vocal segments have
over 10 dB greater signal power than nonvocal
segments. This feature along with a measure of
rhythmic tempo and overall signal loudness are
used in conjunction with a hidden Markov model
for the identification of vocal segments.
This systematic spectral difference between
audio segments with and without singing voice
is a promising detection feature because it does
not impose specific constraints on the structure of
the vocal part (e.g., pitch range, formant structure,
bandwidth) and appears to be robust in the face
of obstruction from different types of instrument
signals. However, such a simple feature hardly
seems adequate to truly identify whether a hu-
man voice is present in the audio, and most likely
reflects a difference between audio segments that
do and do not contain their lead musical part
(whether that part is a voice or a saxophone). Most
likely, the systematic increase in high-frequency
energy in the segments that contain a vocal part
is due to either conscious or subconscious deci-
sions by the recording and mixing engineers to
make the leading musical part (the voice in this
case) the brightest and most audible component
of the recording. For MusicStory, the goal is to
find significant structural events in the music to
help guide video creation. We use the systematic
spectral difference cue to find the places where a
lead harmonic instrument (typically the singer,
in a song with lyrics) is introduced, letting us
structure the video in a way that responds to the
introduction and removal of musically important
elements in the audio. While the current system
uses the systematic spectral difference between
vocal and nonvocal segments in pop songs, future
versions of VocalSearch may use more in-depth
methods to find musically salient points to change
the video presentation style.
To explore the validity of this systematic differ-
ence between vocal and nonvocal audio segments,
we conducted an informal study on 155, 3-6 second
audio segments from 25 rock and popular music
recordings. We calculated log-frequency power
coefficients for each segment and compared the
average LFPCs over all segments that contained
a vocal part (75 segments) to the average LFPCs
over all instrumental segments (80 segments).
Table 2 lists the songs that these short segments
were extracted from.
LFPCs function as a low dimensional rep-
resentation of the magnitude of each segment's
frequency spectrum and capture the general
spectral shape of the audio signal. To calculate
LFPCs, the short-time Fourier transform (STFT)
of each segment is taken using Equation (2). Here,
W represents a 93 ms Hanning window, τ is the
center of a time index and ω is the frequency of
analysis.
1
(τ, ω)
( t - τ)
i w
t
X
t
,
w
)
=
W
(
t
t
)
x
(
t
)
e
dt
(2)
2
p
The Fourier transform captures detailed infor-
mation about how the signal energy is distributed
as a function of frequency and time. In order to
create a more general characterization of the fre-
quency content over the entire audio segment, we
take the mean over all time steps and divide the
signal into 13 frequency sub-bands using Equa-
tion (3). Frequency sub-bands are spaced between
130 Hz and 16 kHz according to the equivalent
rectangular bandwidth (ERB) scale (Moore &
Glasberg, 1983). In Equation (3), k represents
the index of the frequency sub-band, N is the
number of time steps in X ( τ , ω ) , f k represents the
center frequency of the sub-band in Hz, and b k
is the sub-band bandwidth in Hz.
b
k
f
+
k
1
2
∑ ∑
S
(
k
)
=
|
X
(
t
,
w
)
|
(3)
N
b
t
k
w
=
f
k
2
We then calculate log-frequency power coef-
ficients using (4), where M denotes the number of
 
Search WWH ::




Custom Search