Multimode Speech Coding - Digital Speech: Coding for Low Bit Rate Communication Systems

Digital Signal Processing Reference

In-Depth Information

model. The harmonic mode consists of two components: the lower part of

the spectrum or the harmonic bandwidth, which is synthesized as a sum of

coherent sinusoids, and the upper part of the spectrum, which is synthesized

using sinusoids of randomphases. The transitions are synthesized using pulse

excitation, similar to ACELP, and the unvoiced segments are synthesized

using white-noise excitation.

Speech classification is performed by a neural network, which takes into

account the speech parameters of the previous, current, and future frames,

and the previous mode decision. The classification parameters include the

speech energy, spectral tilt, zero-crossing rate, residual peakiness, residual

harmonic matching SNRs, and pitch deviation measures. At the onsets,

when switching from the waveform-coding mode, the harmonic excitation

is synchronized by shifting and maximizing the cross-correlation with the

waveform-coded excitation. At the offsets, the waveform-coding target is

shifted to maximize the cross-correlation with the harmonically-synthesized

speech, similar to the PWI coder.

9.3.3 A4kb/sHybridMELP/CELPCoder

The 4 kb/s hybrid MELP/CELP coder with alignment phase encoding and

zero phase equalization proposed by Stachurski et al . consists of three modes:

strongly-voiced, weakly-voiced, and unvoiced [18, 19]. The weakly-voiced

mode includes transitions and plosives, which is used when neither strongly-

voiced nor unvoiced speech segments are clearly identified. In the strongly-

voiced mode, a mixed excitation linear prediction (MELP) [20, 21] coder

is used. Weakly-voiced and unvoiced modes are synthesized using CELP.

In unvoiced frames, the LPC excitation is generated from a fixed stochastic

codebook. Inweakly-voiced frames, the LPC excitation consists of the sumof a

long-term prediction filter output and a fixed innovation sequence containing

a limited number of pulses, similar to ACELP.

The speech classification is based on the estimated voicing strength and

pitch. The signal continuity at the mode transitions is preserved by trans-

mitting an 'alignment phase' for MELP-encoded frames, and by using 'zero

phase equalization' for transitional frames. The alignment phase preserves

the time-synchrony between the original and synthesized speech. The align-

ment phase is estimated as the linear phase required in the MELP-encoded

excitation generation to maximize the cross-correlation between the MELP

excitation and the corresponding LPC residual. Zero-phase equalization

modifies the CELP target signal, in order to reduce the phase disconti-

nuities, by removing the phase component, which is not coded in MELP.

Zero phase equalization is implemented in the LPC residual domain, with a

Finite Impulse Response (FIR) filter similar to [22]. The FIR filter coefficients

are derived from the smoothed pitch pulse waveforms of the LPC residual

signal. For unvoiced frames the filter coefficients are set to an impulse so

Search WWH ::

Custom Search

Home