Digital Signal Processing Reference
In-Depth Information
6.2.1.9 Fundamental Frequency and Voicing Probability
The fundamental frequency F 0 or the fundamental period length T 0 have a key role
among speech parameters and the prosodic information. The human ear is consider-
ably more sensitive to changes in the fundamental frequency as to changes in other
parameters of the speech signal [ 15 ]. This makes it evident that high precision is
required for its determination, and in fact, the correct determination of F 0 has sig-
nificant influence on intelligent speech analysis as shown, e.g., in the author's work
on emotion recognition in speech [ 32 ].
It may seem an easy task at first, as one has only to determine the period length of
a quasi-periodic signal [ 33 ]. However, a number of factors makes it more challenging
than that, and in fact one of the most difficult tasks of speech signal analysis [ 2 ]:
As stated, in principle, speech production is a non-stationary process. The position
of the vocal tract during articulation may change very quickly leading to significant
changes of the structure of the time-signal of speech. This may occur already from
one fundamental period to the following one [ 16 ]. Further, the multiplicity of used
articulator positions of the human vocal tract in combination with the multiplicity of
human voices result in huge variety of possible time structures of the speech signal.
Then, narrow-band lower formants can easily be confused with the fundamental
frequency. In particular the first formant can easily be confused with F 0 for female
voices, where it is typically found around 200-1 400 Hz [ 2 ]. The excitation signal of
the human voice itself is not always regular. This holds also in normal conditions, i.e.,
in the absence of pathological affects. The voice can further switch into the 'strohbass'
register with a very low frequent and irregular excitation as low as 25 Hz. Across
speakers, the fundamental frequency can further vary among almost four octaves (50 -
800 Hz). Finally, the transmission channel may lead to distortions or band limitations,
such as in the case of (narrow-band) telephone speech (300-3 400 Hz).
This led to a considerable amount of Pitch Detection Algorithms (PDAs), of
which none works to full satisfaction in arbitrary conditions [ 34 ]. Some of these aim
at determination of the fundamental period T 0 , which is equivalent to F 0 by:
1
T 0 .
F 0 =
(6.56)
If T 0 is to be determined, it is considered as momentary value, i.e., the time from the
beginning of one period to the beginning of the subsequent one. If the speech signal
was strictly periodic, both definitions would lead to the same result.
Each PDA can be sub-divided into three steps:
the pre-processing that aims at a data reduction to focus on the problem at hand
the actual extraction,
and the post-processing that usually aims to smooth the overall pitch track and
corrects minor errors, e.g., by Viterbi smoothing (cf. Sect. 7.3.2 ) [ 33 ].
Independent of these steps, PDAs can be parted into two families [ 7 ]: First are
those operating in the short-time domain, i.e., windowing has taken place and a
 
Search WWH ::




Custom Search