Digital Signal Processing Reference
In-Depth Information
6.2.1.8 Formants
The term 'formants' refers to resonance frequencies of the human vocal tract. In
particular the lower resonance frequencies of the vocal tract, i.e., formants F 1 and
F 2 , are highly correlated with the phonetic content and allow for mapping of vowels
and diphthongs (specific concatenation of two vowels) in the F 1 , F 2 plane. In several
languages such as Dutch F 3 also plays an important role for the spoken content,
whereas the higher formants are usually more coined by speaker characteristics [ 2 ].
For vowels and non-nasal consonants the transfer function of the vocal tract H
)
can be approximated as an all-pole transfer function (cf. Sect. 6.2.1.5 ). This corre-
sponds to a mere recursive digital filter as realised by linear prediction. The poles
of H
(
z
are referred to as the formants of the speech signal. When determining the
formants, one usually aims at—in order of relevance—the centre frequency, the band-
width, and the amplitude. Formants are mostly assessed by linear prediction analysis.
Alternative methods based on short time spectra also exist. Thereby, the formants are
can be identified as dominant maxima, e.g., in the spectral envelope or even directly
from the speech signal [ 24 ]. There are, however, a number of difficulties when using
a spectral representation as starting point for formant determination—most domi-
nantly single spikes may exist that exceed the vocal tract's resonance frequencies in
amplitude—e.g., by the fundamental frequency or by noise. Next, these resonance
frequencies or formants can be too close to each other, resulting in them being joined
to a single spectral envelope maximum. These fundamental problems can be eased
by LPC analysis.
Let us now consider formant analysis by linear prediction [ 2 , 25 ]. The purely
recursive filter of the linear prediction fits the smooth envelope of the short time
spectra. Spectral maxima are modelled well—areas of low spectral energy are not.
In the linear model, speech production is modelled by the chain of generation (cf.
Sect. 6.2.1.4 ) starting with the excitation E
(
z
)
(
z
)
(periodic or noise), excitation spectrum
G
(
z
)
, vocal tract H
(
z
)
and radiation R
(
z
)
[ 1 , 19 ]. However, we model the poles
of the spectral function S
of the speech signal. This means, we do not know
of which of the components G
(
z
)
(
z
)
, H
(
z
)
,or R
(
z
)
the poles found in the transfer
function H LP (
z
)
of the prediction filter do originate from. H LP (
z
)
can thus not directly
be assumed as the optimal approximation for H
(
z
)
. Rather, one has to determine
which of the poles of H LP (
can
first be determined by suitable algorithms such as the Newton-Raphson method:
the algorithm is initiated by an estimate for the first pole and then calculates the
polynomial value and its derivative. Then, in an iterative manner, improved estimates
are calculated. This iteration terminates once the delta of subsequent solutions is
smaller than a set threshold. The polynomial can then be divided by this pole and the
algorithm starts over for the now reduced polynomial until all-poles are determined.
A re-iteration per pole with the overall polynomial helps to ease limited precision in
the first round. One can speed-up this process by using the poles from the last speech
frame for initialisation, as the vocal tract position and by that the poles change
comparably slowly over time. Now, a frequency range validation criterion could be
applied to discard poles which do not belong to formants.
z
)
belong to formants [ 26 ]. The poles of H LP (
z
)
 
Search WWH ::




Custom Search