Digital Signal Processing Reference
In-Depth Information
Rather than choosing music by artist or album one sometimes wishes for music
that 'fits the occasion' or one's mood, such as when jogging, relaxing, or perhaps
having dinner for two. Thus, tags such as 'activating', 'calming' or 'romantic' would
be of help in music retrieval [ 147 , 148 ]. Manual annotation by individual users
seems rather labour intensive, but some services exist that provide such tags such
as Allmusic, 7 often based on several users' ratings. Regrettably, this information is
not always reliable, as the tags are often only attached to artists rather than to single
tracks. This leads to the desire of automated mood classification of music. In this
section, we will thus have a look at audio features suited for this particular task,
and benchmark results reachable with state-of-the-art approaches under real-world
conditions—without pre-selection of instances, e.g., by limiting analysis to those
with majority agreement of annotators.
Features for mood recognition can be extracted from the raw audio stream, but
also from metadata. Those derived from the audio can be added by mid-level ones
basing on pre-classification. This means that, apart from the LLDs and functionals
as introduced in Sect. 11.6 , knowledge from other classification tasks such as the
ones introduced for music processing in this chapter can be used as mid-level feature
information describing concepts such as rhythm or tonal structure. Metadata on the
other hand includes all types of textual information available on a music track such
as title, artist, genre, year of release or lyrics.
In the literature so far some commonalities are visible: In [ 149 ] a 30 element
feature vector containing timbre, pitch, and rhythm information is used. The work
in [ 150 ] employs timbre features by spectrum centroid, bandwidth, roll off, and
spectral flux, and seven octave-interval sub-bands' minimum, maximum, and average
amplitude plus RMS energy. For rhythm information the lowest sub-band was used.
Edge detection with a Canny estimator led to a rhythm curve. In this curve peaks
are assumed to indicate bass instruments' onsets, and their strength as indication
for the degree of rhythm presence. Further, analysis by ACF serves as measure for
rhythm steadiness, and the common divisor of the correlation peaks for the tempo.
In [ 151 ] an extension is presented for rhythm analysis by addition of all sub-band
onset curves. The authors of [ 152 ] also use rhythm and timbre features: Two tempo
candidates in BPM are based on peaks in a beat histogram ACF. From this histogram
amplitude ratios and sum of its ranges are added. Timbre is based on 13 MFCCs
[ 153 ] and spectral centroid, flux, and roll off. Mean and standard deviation of the
features over all frames were also included. In [ 154 ]—a MIREX 2008 8 audio mood
classification task contribution—MFCC, CHROMA, and spectral crest and flatness
describe whether the signal spectrum contains peaks, e.g., in case of sinusoidal signals
or it is flat indicating noise.
The learning algorithms vary strongly for this task, just as the mood taxonomies
do (cf. Sect. 5.3.2 ) . In fact, the diverse mood models certainly influence the selection
of the learning algorithm. As an example, in [ 150 , 151 ] a four-class dimensional
model is handled by GMMs as basis for a hierarchical classification system (HCS):
7
Allmusic ( http://www.allmusic.com )
8
MIREX 2008 ( http://www.music-ir.org/mirex/2008 )
Search WWH ::




Custom Search