Digital Signal Processing Reference
In-Depth Information
normalized autocorrelation may be quite low whereas the speech is clearly
voiced. It is therefore necessary to make full use of the speech characteristics
described above to generate a good threshold function. In split-band voicing,
the threshold function is generated as follows [29]:
1. An initial linear threshold function is generated which starts at 0.4 and
goes up to 0.55. The value of the threshold is increased for harmonics
which correspond to the unvoiced harmonics in the previous frame. If the
previous frame is completely unvoiced the threshold increases to 0.55 - 0.65
(increasing the chance of an unvoiced decision in the current frame).
2. The voicing-threshold function is biased using the following individual
parameters:
Low- to full-band energy ratio
Pre-emphasis energy ratio
Zero-crossing rate
Frame energy
These parameters have their high and low thresholds set and, if they are
triggered, the voicing threshold function is biased towards either voiced
or unvoiced.
3. The voicing-threshold function is biased using the pitch value. A high
number of harmonics present in the speech implies that the harmonic
bands are narrow and contain a small number of frequency bins. As
a result, the voicing likelihood tends to increase, as the matching is
performed on fewer points. The voicing threshold function needs to be
biased to compensate for this effect.
4. Finally, very specific cases detected in individual speech characteristics
are used to bias the threshold. For example, very high periodic similarity
is used to increase the voiced likelihood and very high zero-crossing rate
(in clean conditions) is used to increase the unvoiced likelihood.
This voicing determination method provides very robust detection accuracy,
even under significant background noise conditions.
6.4 Summary
Developments in the field of fast DSP technology have allowed the use of more
and more sophisticated algorithms required for accurate pitch estimation and
voiced - unvoiced classification. With the new multi-domain (frequency and
time) pitch estimation, it is possible to get good performance even under noisy
conditions. However, even the latest and most complex pitch estimation algo-
rithms are not perfect. In some speech segments, the pitch is not well-defined
Search WWH ::




Custom Search