Information Technology Reference
In-Depth Information
model can be put to use by the agent to optimize
the display and interaction.
In our current work, pacing is affected by song
tempo and the agent's artistic intent. The direct-
ing agent adjusts how long a image is displayed
and the speed of transition between images, in-
fluenced by placement of peaks in the volume of
the audio. The final slide show pacing does not
map strictly to beat-by-beat image transitions,
but visually moves at a speed complimentary to
a multiple of the beat. To make this adjustment,
the agent first needs to know the general pace of
the song. For many kinds of popular music, good
synchronization points for the video correspond
to peaks in the root-mean-squared amplitude of
the audio signal (RMS). To use RMS for this
application, we must look at the structure of the
digitally encoded song.
Finding the beat in music is often problematic
(Pardo, 2004), especially in cases of odd-meter
and shifting metric levels. It is, however, simple to
find the average pace at which percussive events
occur in a passage of music. For the purposes of
video pacing, this turns out to be an important
and useful measure
music a half-time feel. This works quite well in
the Pop/Rock genre. More complex music styles
require more sophisticated techniques (Pikrakis,
Antonopoulos, & Theodoridis, 2004).
finding structure
The next step requires MusicStory to identify the
presence of vocals (hence lyrics) during a song.
The ability to detect vocal segments allows the
directing agent to tighten its focus on the photo
narrative aspects of the music video. Considering
the ease with which human listeners can detect
whether or not a singing voice is present in a musi-
cal recording, it may be surprising that automatic
detection of singing voice is an active area of
research within the music information retrieval
community. Identifying whether a vocal part is
present in a given segment of an audio signal is
challenging because of both the timbral diversity
the human voice is capable of and that many com-
mercial recordings include dozens of instrument
parts and are laden with effects processing, which
often obstructs any salient characteristics of the
voice. Without prior structural knowledge of the
vocal or instrument signals, or knowledge of how
many or what types of signals are in the recording,
we cannot directly attribute components or sound
events (notes or percussion hits) in the song to
(1)
To find the event pace, we compute the RMS
amplitude of the audio at time t by applying ba-
sic RMS to a 100 millisecond window, centered
on time t, where n is the number of samples in
the window and x is the amplitude of a single
sample. For simplicity, a compressed audio file is
converted to linearly encoded PCM audio before
RMS amplitude is calculated.
Figure 4 shows the RMS amplitude for the
first 10 seconds of Michael Jackson's Billie Jean .
An average of the peak distances (1025ms) yields
the overall pace for the song (about 59 pulses per
minute). By walking through the entire song, one
can easily detect sections whose pacing gives the
Figure 4. The RMS values for the intro to Billie
Jean. An average of the peak distances (1025ms)
yields an overall pace for the song.
Search WWH ::




Custom Search