Applications in Intelligent Music Analysis - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

[

]

α =

being optimal in theory cannot be used. Evaluating

revealed

as best solution.

Comb filter banks are instantiated for a broad ranger to also cover higher metrical

layers. Features as obtained by the outputs of the filters describe the distribution of

resonances of several metrical layers, and by that the metrical structure. To keep the

number of comb filters in reasonable limits, one can exploit the multiple metrical

layers present in a musical piece: At first, the Tatum tempo is estimated. Then, poten-

tially present higher metrical levels are assumed to have tempi at integer multiples

of this tempo. This is true for a broad range of genres.

For the processing, the audio signal is down sampled to f s

11.025 kHz and

converted into a monophonic signal by stereo-channel addition. The input of length

L i seconds is chunked by Hamming windowing into N frames

100

L i frames

of N s , block

256 samples with a frame overlap of 0.57. This resembles a frame

rate of 100 FPS. 128 DFT coefficients are then computed per frame. M mel overlap-

ping triangular filters which are equidistant on the Mel-frequency scale as used in

speech recognition for the computation of MFCC [ 82 ] (cf. Sect. 6.2.1.4 ) reduce these

coefficients to envelope samples of M mel non-linear bands. The reduced number of

frequency bands covers the human auditory frequency range. According to [ 5 ], the

rhythmic structure is entirely preserved in this compact form of representation. The

envelope samples x m , k per Mel-frequency band m are logarithmised by:

log x m , k +

x m , k , log =

(11.17)

The envelopes x m of the Mel-frequency bands are then low-pass filtered for smooth-

ing. This is realised by convolution with a half-wave raised cosine filter h cos .The

length of 15 envelope samples, or 150 ms, respectively has proven a good value—

overall, it preserves fast attacks, but filters noise and rapid modulation, similar to

human sound sensation:

cos π

h cos (

) =

∈[

]

(11.18)

Per low-pass filtered Mel-frequency band envelope m a weighted differential d m is

applied:

) = x m , k −

x m , k , l ·

d m (

x m , k , r

(11.19)

For a sample x m , k at position k a moving avera g e is calculated over one window of

10 samples to the left of sample x m , k (left mean x m , k , l ) as well as a second window of

20 samples to the right of sample x m , k (right mean x m , k , r )[ 6 ]. The motivation is that

human's perceive note onsets as more intense after a longer phase of lower sound

level [ 85 ]. Further, note duration and energy are crucial factors in the perceived note

accentuation [ 75 ].

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home