Digital Signal Processing Reference
In-Depth Information
model. The harmonic mode consists of two components: the lower part of
the spectrum or the harmonic bandwidth, which is synthesized as a sum of
coherent sinusoids, and the upper part of the spectrum, which is synthesized
using sinusoids of randomphases. The transitions are synthesized using pulse
excitation, similar to ACELP, and the unvoiced segments are synthesized
using white-noise excitation.
Speech classification is performed by a neural network, which takes into
account the speech parameters of the previous, current, and future frames,
and the previous mode decision. The classification parameters include the
speech energy, spectral tilt, zero-crossing rate, residual peakiness, residual
harmonic matching SNRs, and pitch deviation measures. At the onsets,
when switching from the waveform-coding mode, the harmonic excitation
is synchronized by shifting and maximizing the cross-correlation with the
waveform-coded excitation. At the offsets, the waveform-coding target is
shifted to maximize the cross-correlation with the harmonically-synthesized
speech, similar to the PWI coder.
9.3.3 A4kb/sHybridMELP/CELPCoder
The 4 kb/s hybrid MELP/CELP coder with alignment phase encoding and
zero phase equalization proposed by Stachurski et al . consists of three modes:
strongly-voiced, weakly-voiced, and unvoiced [18, 19]. The weakly-voiced
mode includes transitions and plosives, which is used when neither strongly-
voiced nor unvoiced speech segments are clearly identified. In the strongly-
voiced mode, a mixed excitation linear prediction (MELP) [20, 21] coder
is used. Weakly-voiced and unvoiced modes are synthesized using CELP.
In unvoiced frames, the LPC excitation is generated from a fixed stochastic
codebook. Inweakly-voiced frames, the LPC excitation consists of the sumof a
long-term prediction filter output and a fixed innovation sequence containing
a limited number of pulses, similar to ACELP.
The speech classification is based on the estimated voicing strength and
pitch. The signal continuity at the mode transitions is preserved by trans-
mitting an 'alignment phase' for MELP-encoded frames, and by using 'zero
phase equalization' for transitional frames. The alignment phase preserves
the time-synchrony between the original and synthesized speech. The align-
ment phase is estimated as the linear phase required in the MELP-encoded
excitation generation to maximize the cross-correlation between the MELP
excitation and the corresponding LPC residual. Zero-phase equalization
modifies the CELP target signal, in order to reduce the phase disconti-
nuities, by removing the phase component, which is not coded in MELP.
Zero phase equalization is implemented in the LPC residual domain, with a
Finite Impulse Response (FIR) filter similar to [22]. The FIR filter coefficients
are derived from the smoothed pitch pulse waveforms of the LPC residual
signal. For unvoiced frames the filter coefficients are set to an impulse so
Search WWH ::




Custom Search