Digital Signal Processing Reference
In-Depth Information
In addition to correct pitch estimation, correct voiced - unvoiced estima-
tion is also crucial for good quality speech synthesis. Traditional vocoders,
which have been in use for many years, classify the input speech signal
either as voiced or unvoiced. A voiced speech segment is known by its
relatively high energy content but, more importantly, it contains periodicity.
The unvoiced part of speech on the other hand looks more like random
noise with no periodicity. However, there are some parts of speech that are
neither voiced nor unvoiced, but a mixture of the two. These are usually
called the transition regions, where there is a change either from voiced
to unvoiced or unvoiced to voiced. In low bit-rate speech coding, correct
classification of speech blocks (usually frames or subframes 20 ms long, or
shorter) is very critical for good quality speech synthesis. If voiced speech
is classified as unvoiced, the synthesized output will sound rough and less
intelligible. If, on the other hand, unvoiced speech is classified as voiced,
the synthesized speech will sound annoyingly metallic or robotic. In older
versions of vocoders, a hard decision voicing was used and the transitions
were classified into either fully voiced or fully unvoiced. In newer vocoders,
such as sinusoidal based coders (IMBE, MELP, etc.), soft decision voic-
ing is employed: a third class, in which both voiced and unvoiced exists
together, has been defined. This mix of voiced and unvoiced decision is
usually carried out in the frequency domain where voiced and unvoiced
frequencies are appropriately selected to represent the mixed signal. As a
result, better quality synthesized speech is produced. In this chapter we
review some of the advanced techniques which are used in extracting the
correct pitch and subsequently estimating the correct voicing in each speech
segment.
6.2 Pitch Estimation Methods
The excitation model used in source-filter vocoders relies heavily on the
correct determination of the pitch parameter. Incorrect pitch estimation may
significantly degrade the speech quality, and in particular its intelligibility,
by introducing artifacts into the synthetic speech. Moreover, other parameter
estimations such as voicing and spectral amplitudes in vocoders often assume
accurate pitch determination, and are severely affected by pitch errors.
Therefore, the reliability of the pitch determination algorithm (PDA) used
has a dramatic effect on the quality of the synthesized speech.
Pitch period is defined as the time interval between two consecutive voiced
(periodic) excitation cycles. Although, this interval may vary from cycle to
cycle, it usually evolves slowly, and therefore it can be estimated. Estimating
the pitch period is generally easy for highly periodic sounds, but some speech
Search WWH ::




Custom Search