Digital Signal Processing Reference
In-Depth Information
where
N
, the length of the analysis frames, is 160 and
e
h
is an autoregressive
energy term given by,
=
+
e
h
0
.
9
e
h
0
.
1
e
if 8
e > e
h
(9.50)
The condition 8
e > e
h
ensures that
e
h
is updated only when the speech energy
is sufficiently high and
e
h
should be initialized to approximately the mean
squared energy of voiced speech. Figure 9.19a illustrates the tracked energy
over a segment of speech. The low-band to high-band energy ratio,
γ
ω
,is
estimated as follows:
1/4
S
2
ω
ω
s
d
ω
ω
s
0
γ
ω
=
(9.51)
S
2
ω
ω
s
d
ω
ω
s
1/2
1
/
4
where
ω
s
is the sampling frequency and
S(ω)
is the speech spectrum. The
speech spectrum is estimated using a 512-point FFT, after windowing 240
speech samples with a Kaiser window of
β
=
6
.
0. Figure 9.19b illustrates the
low-band to high-band energy ratio over a segment of speech, where the
speech signal is shifted down for clarity.
The zero-crossing rate is defined as the number of times the signal changes
sign, divided by the number of samples used in the observation. Figure 9.20a
illustrates the zero-crossing rate over a segment of speech, where the speech
1
200
100
0.5
0
0
−
100
s(n)
s(n)
−
0.5
−
200
0
1000
2000
3000
0
1000
2000
3000
4000
samples
samples
(a) Tracked energy, t
e
(b) Low-band to high-band energy ratio,
γ
ω
Figure 9.19
Voicing metrics of the initial classification
Search WWH ::
Custom Search