Digital Signal Processing Reference
In-Depth Information
Closed-loop mode selection has two major difficulties: high complexity
and difficulty in finding an objective measure which reflects the subjective
quality of synthesized speech [46]. The existing closed-loop mode selection
coders are based on CELP, and select the best configuration such that the
weighted MSE is minimized [47, 48]. Open-loop mode selection is based
on techniques such as: voice activity detection, voicing decision, spectral
envelope variation, speech energy, and phonetic classification [10]. See [49]
for a detailed description on acoustic phonetics.
In the following discussion, a hybridmode selection technique is used, with
an open-loop initial classification and a closed-loop secondary classification.
The open loop initial classification decides to use either the noise excitation or
one of the other modes. The secondary classification synthesizes the harmonic
excitation and makes a closed loop decision to use either the harmonic
excitation or ACELP. A special feature of this classifier is the application of
closed-loop mode selection to harmonic coding. The SWPM [26] preserves
the waveform similarity of the harmonically-synthesized speech, making it
possible to apply closed-loop techniques in harmonic coding.
9.6.1 Open-LoopInitialClassification
The initial classification extracts the fully unvoiced and silence segments of
speech, which are synthesized using white-noise excitation. It is based on
tracked energy, the low-band to high-band energy ratio, and the zero-crossing
rate of the speech signal. The three voicing metrics are logically combined to
enhance the reliability, since a single metric alone is not sufficient to make
a decision with high confidence. The metric combinations and thresholds
are determined empirically, by plotting the metrics with the corresponding
speech waveforms. A statistical approach is not suitable for deciding the
thresholds, because the design of the classification algorithm should consider
that a misclassification of a voiced segment as unvoiced will severely degrade
the speech quality, but a misclassification of an unvoiced segment as voiced
can be tolerated. A misclassified unvoiced segment will be synthesized using
ACELP, however a misclassified voiced segment will be synthesized using
noise excitation.
The tracked energy of speech, t e is estimated as follows:
0 . 00025 e h
+
e
=
t e
(9.48)
0 . 01 e h +
e
where e is the mean squared speech energy, given by,
N
1
s 2 (n)
n
=
0
e
=
(9.49)
N
Search WWH ::




Custom Search