Pitch Estimation and Voiced–Unvoiced Classification of Speech - Digital Speech: Coding for Low Bit Rate Communication Systems

Digital Signal Processing Reference

In-Depth Information

In addition to correct pitch estimation, correct voiced - unvoiced estima-

tion is also crucial for good quality speech synthesis. Traditional vocoders,

which have been in use for many years, classify the input speech signal

either as voiced or unvoiced. A voiced speech segment is known by its

relatively high energy content but, more importantly, it contains periodicity.

The unvoiced part of speech on the other hand looks more like random

noise with no periodicity. However, there are some parts of speech that are

neither voiced nor unvoiced, but a mixture of the two. These are usually

called the transition regions, where there is a change either from voiced

to unvoiced or unvoiced to voiced. In low bit-rate speech coding, correct

classification of speech blocks (usually frames or subframes 20 ms long, or

shorter) is very critical for good quality speech synthesis. If voiced speech

is classified as unvoiced, the synthesized output will sound rough and less

intelligible. If, on the other hand, unvoiced speech is classified as voiced,

the synthesized speech will sound annoyingly metallic or robotic. In older

versions of vocoders, a hard decision voicing was used and the transitions

were classified into either fully voiced or fully unvoiced. In newer vocoders,

such as sinusoidal based coders (IMBE, MELP, etc.), soft decision voic-

ing is employed: a third class, in which both voiced and unvoiced exists

together, has been defined. This mix of voiced and unvoiced decision is

usually carried out in the frequency domain where voiced and unvoiced

frequencies are appropriately selected to represent the mixed signal. As a

result, better quality synthesized speech is produced. In this chapter we

review some of the advanced techniques which are used in extracting the

correct pitch and subsequently estimating the correct voicing in each speech

segment.

6.2 Pitch Estimation Methods

The excitation model used in source-filter vocoders relies heavily on the

correct determination of the pitch parameter. Incorrect pitch estimation may

significantly degrade the speech quality, and in particular its intelligibility,

by introducing artifacts into the synthetic speech. Moreover, other parameter

estimations such as voicing and spectral amplitudes in vocoders often assume

accurate pitch determination, and are severely affected by pitch errors.

Therefore, the reliability of the pitch determination algorithm (PDA) used

has a dramatic effect on the quality of the synthesized speech.

Pitch period is defined as the time interval between two consecutive voiced

(periodic) excitation cycles. Although, this interval may vary from cycle to

cycle, it usually evolves slowly, and therefore it can be estimated. Estimating

the pitch period is generally easy for highly periodic sounds, but some speech

Search WWH ::

Custom Search

Home