Digital Signal Processing Reference
In-Depth Information
α
[
.
,
.
]
α =
.
being optimal in theory cannot be used. Evaluating
in
0
20
0
99
revealed
0
7
as best solution.
Comb filter banks are instantiated for a broad ranger to also cover higher metrical
layers. Features as obtained by the outputs of the filters describe the distribution of
resonances of several metrical layers, and by that the metrical structure. To keep the
number of comb filters in reasonable limits, one can exploit the multiple metrical
layers present in a musical piece: At first, the Tatum tempo is estimated. Then, poten-
tially present higher metrical levels are assumed to have tempi at integer multiples
of this tempo. This is true for a broad range of genres.
For the processing, the audio signal is down sampled to f s
11.025 kHz and
converted into a monophonic signal by stereo-channel addition. The input of length
L i seconds is chunked by Hamming windowing into N frames
=
=
100
·
L i frames
of N s , block
256 samples with a frame overlap of 0.57. This resembles a frame
rate of 100 FPS. 128 DFT coefficients are then computed per frame. M mel overlap-
ping triangular filters which are equidistant on the Mel-frequency scale as used in
speech recognition for the computation of MFCC [ 82 ] (cf. Sect. 6.2.1.4 ) reduce these
coefficients to envelope samples of M mel non-linear bands. The reduced number of
frequency bands covers the human auditory frequency range. According to [ 5 ], the
rhythmic structure is entirely preserved in this compact form of representation. The
envelope samples x m , k per Mel-frequency band m are logarithmised by:
=
log x m , k +
1
x m , k , log =
10
·
(11.17)
The envelopes x m of the Mel-frequency bands are then low-pass filtered for smooth-
ing. This is realised by convolution with a half-wave raised cosine filter h cos .The
length of 15 envelope samples, or 150 ms, respectively has proven a good value—
overall, it preserves fast attacks, but filters noise and rapid modulation, similar to
human sound sensation:
cos π
k
15
h cos (
k
) =
+
1
,
k
∈[
1
,
15
]
(11.18)
Per low-pass filtered Mel-frequency band envelope m a weighted differential d m is
applied:
) = x m , k
x m , k , l ·
d m (
k
x m , k , r
(11.19)
For a sample x m , k at position k a moving avera g e is calculated over one window of
10 samples to the left of sample x m , k (left mean x m , k , l ) as well as a second window of
20 samples to the right of sample x m , k (right mean x m , k , r )[ 6 ]. The motivation is that
human's perceive note onsets as more intense after a longer phase of lower sound
level [ 85 ]. Further, note duration and energy are crucial factors in the perceived note
accentuation [ 75 ].
 
Search WWH ::




Custom Search