Databases Reference
In-Depth Information
Excitat ion
source
Vocal tr act
filter
Speech
F I GU R E 18 . 1
A model for speech synthesis.
limitations, it fits in with the techniques described in this chapter; that is, what is stored or
transmitted is not the samples of the source output, but a method for synthesizing the output.
We will study this approach in Section 18.6 .
18.3 Speech Compression
A very simplified model of speech synthesis is shown in Figure 18.1 . As we described in
Chapter 7, speech is produced by forcing air first through an elastic opening, the vocal cords,
and then through the laryngeal, oral, nasal, and pharynx passages, and finally through the
mouth and the nasal cavity. Everything past the vocal cords is generally referred to as the
vocal tract. The first action generates the sound, which is then modulated into speech as it
traverses through the vocal tract.
In Figure 18.1 , the excitation source corresponds to the sound generation, and the vocal tract
filter models the vocal tract. As we mentioned in Chapter 7, there are several different sound
inputs that can be generated by different conformations of the vocal cords and the associated
cartilages.
Therefore, in order to generate a specific fragment of speech, we have to generate a sequence
of sound inputs or excitation signals and the corresponding sequence of appropriate vocal tract
approximations.
At the transmitter, the speech is divided into segments. Each segment is analyzed to
determine an excitation signal and the parameters of the vocal tract filter. In some of the
schemes, a model for the excitation signal is transmitted to the receiver. The excitation signal
is then synthesized at the receiver and used to drive the vocal tract filter. In other schemes,
the excitation signal itself is obtained using an analysis-by-synthesis approach. This signal is
then used by the vocal tract filter to generate the speech signal.
Over the years many different analysis/synthesis speech compression schemes have been
developed, and substantial research into the development of new approaches and the improve-
ment of existing schemes continues. Given the large amount of information, we can only
sample some of the more popular approaches in this chapter. See [ 283 , 284 ] for more detailed
coverage and pointers to the vast literature on the subject.
The approaches we will describe in this chapter include channel vocoders , which are of
special historical interest; the linear predictive coder , which is the U.S. Government standard at
the rate of 2.4kbps; code-excited linear prediction (CELP) based schemes; sinusoidal coders ,
which provide excellent performance at rates of 4.8kbps and higher and are also a part of
several national and international standards; and mixed excitation linear prediction , which is
the new 2.4kbps federal standard speech coder. In our description of these approaches, we
will use the various national and international standards as examples.
 
Search WWH ::




Custom Search