Databases Reference
In-Depth Information
At the transmitter, a segment of speech is analyzed. The parameters obtained include a
decision as to whether the segment of speech is voiced or unvoiced, the pitch period if the
segment is declared voiced, and the parameters of the vocal tract filter. In this section, we will
take a somewhat detailed look at the various components that make up the linear predictive
coder. As an example, we will use the specifications for the 2.4-kbit U.S. Government standard
LPC-10.
The input speech is generally sampled at 8000 samples per second. In the LPC-10 standard,
the speech is broken into 180 sample segments, corresponding to 22.5 milliseconds of speech
per segment.
The Voiced/Unvoiced Decision
If we compare Figures 18.2 and 18.3 , we can see there are two major differences. Notice
that the samples of the voiced speech have larger amplitude; that is, there is more energy in
the voiced speech. Also, the unvoiced speech contains higher frequencies. As both speech
segments have average values close to zero, this means that the unvoiced speech waveform
crosses the x
0 line more often than the voiced speech sample. Therefore, we can get a
fairly good idea about whether the speech is voiced or unvoiced based on the energy in the
segment relative to background noise and the number of zero crossings within a specified
window. In the LPC-10 algorithm, the speech segment is first low-pass filtered using a filter
with a bandwidth of 1kHz. The energy at the output relative to the background noise is used to
obtain a tentative decision about whether the signal in the segment should be declared voiced
or unvoiced. The estimate of the background noise is basically the energy in the unvoiced
speech segments. This tentative decision is further refined by counting the number of zero
crossings and checking the magnitude of the coefficients of the vocal tract filter. We will talk
more about this latter point later in this section. Finally, it can be perceptually annoying to
have a single voiced frame sandwiched between unvoiced frames. The voicing decision of the
neighboring frames is considered in order to prevent this from happening.
=
Estimating the Pitch Period
Estimating the pitch period is one of the most computationally intensive steps of the analysis
process. Over the years a number of different algorithms for pitch extraction have been
developed. In Figure 18.2 , it would appear that obtaining a good estimate of the pitch should
be relatively easy. However, we should keep in mind that the segment shown in Figure 18.2
consists of 800 samples, which is considerably more than the samples available to the analysis
algorithm. Furthermore, the segment shown here is noise-free and consists entirely of a voiced
input. It can be a difficult undertaking for a machine to extract the pitch from a short noisy
segment that may contain both voiced and unvoiced samples.
Several algorithms make use of the fact that the autocorrelation of a periodic function
R xx (
will have a maximumwhen k is equal to the pitch period. Coupled with the fact that the
estimation of the autocorrelation function generally leads to a smoothing out of the noise, this
makes the autocorrelation function a useful tool for obtaining the pitch period. Unfortunately,
there are also some problems with the use of the autocorrelation. Voiced speech is not exactly
periodic, which makes the maximum lower than we would expect from a periodic signal.
k
)
Search WWH ::




Custom Search