Pitch Estimation and Voiced–Unvoiced Classification of Speech - Digital Speech: Coding for Low Bit Rate Communication Systems

Digital Signal Processing Reference

In-Depth Information

normalized autocorrelation may be quite low whereas the speech is clearly

voiced. It is therefore necessary to make full use of the speech characteristics

described above to generate a good threshold function. In split-band voicing,

the threshold function is generated as follows [29]:

1. An initial linear threshold function is generated which starts at 0.4 and

goes up to 0.55. The value of the threshold is increased for harmonics

which correspond to the unvoiced harmonics in the previous frame. If the

previous frame is completely unvoiced the threshold increases to 0.55 - 0.65

(increasing the chance of an unvoiced decision in the current frame).

2. The voicing-threshold function is biased using the following individual

parameters:

•

Low- to full-band energy ratio

•

Pre-emphasis energy ratio

•

Zero-crossing rate

•

Frame energy

These parameters have their high and low thresholds set and, if they are

triggered, the voicing threshold function is biased towards either voiced

or unvoiced.

3. The voicing-threshold function is biased using the pitch value. A high

number of harmonics present in the speech implies that the harmonic

bands are narrow and contain a small number of frequency bins. As

a result, the voicing likelihood tends to increase, as the matching is

performed on fewer points. The voicing threshold function needs to be

biased to compensate for this effect.

4. Finally, very specific cases detected in individual speech characteristics

are used to bias the threshold. For example, very high periodic similarity

is used to increase the voiced likelihood and very high zero-crossing rate

(in clean conditions) is used to increase the unvoiced likelihood.

This voicing determination method provides very robust detection accuracy,

even under significant background noise conditions.

6.4 Summary

Developments in the field of fast DSP technology have allowed the use of more

and more sophisticated algorithms required for accurate pitch estimation and

voiced - unvoiced classification. With the new multi-domain (frequency and

time) pitch estimation, it is possible to get good performance even under noisy

conditions. However, even the latest and most complex pitch estimation algo-

rithms are not perfect. In some speech segments, the pitch is not well-defined

Search WWH ::

Custom Search

Home