Multimode Speech Coding - Digital Speech: Coding for Low Bit Rate Communication Systems

Digital Signal Processing Reference

In-Depth Information

Closed-loop mode selection has two major difficulties: high complexity

and difficulty in finding an objective measure which reflects the subjective

quality of synthesized speech [46]. The existing closed-loop mode selection

coders are based on CELP, and select the best configuration such that the

weighted MSE is minimized [47, 48]. Open-loop mode selection is based

on techniques such as: voice activity detection, voicing decision, spectral

envelope variation, speech energy, and phonetic classification [10]. See [49]

for a detailed description on acoustic phonetics.

In the following discussion, a hybridmode selection technique is used, with

an open-loop initial classification and a closed-loop secondary classification.

The open loop initial classification decides to use either the noise excitation or

one of the other modes. The secondary classification synthesizes the harmonic

excitation and makes a closed loop decision to use either the harmonic

excitation or ACELP. A special feature of this classifier is the application of

closed-loop mode selection to harmonic coding. The SWPM [26] preserves

the waveform similarity of the harmonically-synthesized speech, making it

possible to apply closed-loop techniques in harmonic coding.

9.6.1 Open-LoopInitialClassification

The initial classification extracts the fully unvoiced and silence segments of

speech, which are synthesized using white-noise excitation. It is based on

tracked energy, the low-band to high-band energy ratio, and the zero-crossing

rate of the speech signal. The three voicing metrics are logically combined to

enhance the reliability, since a single metric alone is not sufficient to make

a decision with high confidence. The metric combinations and thresholds

are determined empirically, by plotting the metrics with the corresponding

speech waveforms. A statistical approach is not suitable for deciding the

thresholds, because the design of the classification algorithm should consider

that a misclassification of a voiced segment as unvoiced will severely degrade

the speech quality, but a misclassification of an unvoiced segment as voiced

can be tolerated. A misclassified unvoiced segment will be synthesized using

ACELP, however a misclassified voiced segment will be synthesized using

noise excitation.

The tracked energy of speech, t e is estimated as follows:

0 . 00025 e h

+

e

=

t e

(9.48)

0 . 01 e h +

e

where e is the mean squared speech energy, given by,

N

−

1

s 2 (n)

n

=

0

e

=

(9.49)

N

Digital Speech: Coding for Low Bit Rate Communication Systems

Search WWH ::

Custom Search

Home