MusicStory: An Autonomous, Personalized Music Video Creator - Intelligent Music Information Systems: Tools and Methodologies

Information Technology Reference

In-Depth Information

instrumental segments in frequency sub-bands

above 1 kHz, with the largest difference in the

highest frequencies, where vocal segments have

over 10 dB greater signal power than nonvocal

segments. This feature along with a measure of

rhythmic tempo and overall signal loudness are

used in conjunction with a hidden Markov model

for the identification of vocal segments.

This systematic spectral difference between

audio segments with and without singing voice

is a promising detection feature because it does

not impose specific constraints on the structure of

the vocal part (e.g., pitch range, formant structure,

bandwidth) and appears to be robust in the face

of obstruction from different types of instrument

signals. However, such a simple feature hardly

seems adequate to truly identify whether a hu-

man voice is present in the audio, and most likely

reflects a difference between audio segments that

do and do not contain their lead musical part

(whether that part is a voice or a saxophone). Most

likely, the systematic increase in high-frequency

energy in the segments that contain a vocal part

is due to either conscious or subconscious deci-

sions by the recording and mixing engineers to

make the leading musical part (the voice in this

case) the brightest and most audible component

of the recording. For MusicStory, the goal is to

find significant structural events in the music to

help guide video creation. We use the systematic

spectral difference cue to find the places where a

lead harmonic instrument (typically the singer,

in a song with lyrics) is introduced, letting us

structure the video in a way that responds to the

introduction and removal of musically important

elements in the audio. While the current system

uses the systematic spectral difference between

vocal and nonvocal segments in pop songs, future

versions of VocalSearch may use more in-depth

methods to find musically salient points to change

the video presentation style.

To explore the validity of this systematic differ-

ence between vocal and nonvocal audio segments,

we conducted an informal study on 155, 3-6 second

audio segments from 25 rock and popular music

recordings. We calculated log-frequency power

coefficients for each segment and compared the

average LFPCs over all segments that contained

a vocal part (75 segments) to the average LFPCs

over all instrumental segments (80 segments).

Table 2 lists the songs that these short segments

were extracted from.

LFPCs function as a low dimensional rep-

resentation of the magnitude of each segment's

frequency spectrum and capture the general

spectral shape of the audio signal. To calculate

LFPCs, the short-time Fourier transform (STFT)

of each segment is taken using Equation (2). Here,

W represents a 93 ms Hanning window, τ is the

center of a time index and ω is the frequency of

analysis.

∞

(τ, ω)

∫

( t - τ)

−

i w

)

(

−

)

(

)

(2)

−

∞

The Fourier transform captures detailed infor-

mation about how the signal energy is distributed

as a function of frequency and time. In order to

create a more general characterization of the fre-

quency content over the entire audio segment, we

take the mean over all time steps and divide the

signal into 13 frequency sub-bands using Equa-

tion (3). Frequency sub-bands are spaced between

130 Hz and 16 kHz according to the equivalent

rectangular bandwidth (ERB) scale (Moore &

Glasberg, 1983). In Equation (3), k represents

the index of the frequency sub-band, N is the

number of time steps in X ( τ , ω ) , f k represents the

center frequency of the sub-band in Hz, and b k

is the sub-band bandwidth in Hz.

∑ ∑

(

)

(

)

(3)

∀

−

We then calculate log-frequency power coef-

ficients using (4), where M denotes the number of

Intelligent Music Information Systems: Tools and Methodologies

Search WWH ::

Custom Search

Home