Information Technology Reference
In-Depth Information
individual instrument or vocal parts. As a result,
researchers must search for signal features that
are common to all voices, whether they are male
or female and whether the singer is Etta James or
John Ashcroft. These features and the classifica-
tion method used must also be salient in the face
of significant obstruction from instruments in the
recording. Both of these constraints motivate look-
ing for characteristic differences between purely
instrumental audio segments and instrument plus
voice segments rather than directly searching for
the vocal waveform or spectrum.
Berenzweig and Ellis (2001) locate singing
voice segments by incorporating the acoustic clas-
sifier of a speech recognizer to identify speech-like
sounds in the music signal. The speech classi-
fier is trained to identify English phonemes, so
when the input signal is English speech, clear-cut
classification to a particular phoneme is achieved
for the majority of time segments (or syllables).
Berenzweig and Ellis make novel use of this
classifier and show that when music signals with
and without singing voice are processed, the clas-
sification of those segments that contain a vocal
part is somewhat similar to the classification of
speech and is characteristically different than
the classification of the purely instrumental seg-
ments. Berenzweig and Ellis propose a number of
statistical measures to identify transitions in the
nature of the speech classifier's output—which
are then treated as transitions between vocal and
nonvocal segments.
Kim and Whitman (2002) use a band-pass filter
(200-2000 Hz) to eliminate energy from signals
that fall outside of the probable vocal range. They
argue that any remaining instruments within
this range will likely be broadband in nature
(percussion instruments), and thus propose using
harmonicity —the proportion of harmonic vs.
inharmonic energy—as an additional feature to
distinguish between vocal and percussive sounds.
However, the assertion that only voice and per-
cussive instruments have significant energy in
the 200-2000 Hz frequency band is unrealistic
because nearly all harmonic instruments could
also have energy concentrated in this range,
thus the success of this approach seems largely
dependent on the vocal part having significantly
more energy in this range than other harmonic
instruments.
Maddage, Wan, Xu, and Wang (2004) propose
incorporating top-down musical knowledge in
identifying vocal segments. They use rhythm
tracking in order to segment the audio signal
according to quarter-note length frames. Each
frame is then processed using the twice-iterated
composite Fourier transform (TICFT). They
claim that voice signals tend to have a narrower
bandwidth than most instrument signals, and that
this characteristic is exaggerated in the lower
order coefficients of the TICFT. A simple linear
threshold on the cumulative energy in the low
TICFT coefficients is used to determine whether
a singing voice was active in a particular frame.
Three heuristic rules based on simple musical
knowledge and beat information are used to refine
the determination of vocal segments. While the
heuristics do improve classification for music that
adheres to the imposed constraints (the meter of
the song is 4/4, vocal segments must be at least two
measures long, etc.), testing on a larger database
of recordings is necessary to determine whether
the approach is robust when given any musical
input. Also, while certainly some instruments
commonly found in popular music do have a
wider bandwidth than vocal signals—percussion,
guitar or piano—this claim is not well substanti-
ated in the general case, and it is unclear that this
characteristic would be robust in recordings with
many instruments present.
Nwe and Wang (2004) observe that audio seg-
ments that contain both voice and instruments tend
to have more high-frequency energy than purely
instrumental segments. They report a systematic
difference in log-frequency power coefficients
(LFPCs) between instrumental and instrument
plus vocal segments. They show that the instru-
ment plus voice segments have more power than
Search WWH ::




Custom Search