Digital Signal Processing Reference
In-Depth Information
where N , the length of the analysis frames, is 160 and e h is an autoregressive
energy term given by,
=
+
e h
0 . 9 e h
0 . 1 e
if 8 e > e h
(9.50)
The condition 8 e > e h ensures that e h is updated only when the speech energy
is sufficiently high and e h should be initialized to approximately the mean
squared energy of voiced speech. Figure 9.19a illustrates the tracked energy
over a segment of speech. The low-band to high-band energy ratio, γ ω ,is
estimated as follows:
1/4
S 2 ω
ω s
d ω
ω s
0
γ ω =
(9.51)
S 2 ω
ω s
d ω
ω s
1/2
1 / 4
where ω s is the sampling frequency and S(ω) is the speech spectrum. The
speech spectrum is estimated using a 512-point FFT, after windowing 240
speech samples with a Kaiser window of β =
6 . 0. Figure 9.19b illustrates the
low-band to high-band energy ratio over a segment of speech, where the
speech signal is shifted down for clarity.
The zero-crossing rate is defined as the number of times the signal changes
sign, divided by the number of samples used in the observation. Figure 9.20a
illustrates the zero-crossing rate over a segment of speech, where the speech
1
200
100
0.5
0
0
100
s(n)
s(n)
0.5
200
0
1000
2000
3000
0
1000
2000
3000
4000
samples
samples
(a) Tracked energy, t e
(b) Low-band to high-band energy ratio, γ ω
Figure 9.19 Voicing metrics of the initial classification
Search WWH ::




Custom Search