Digital Signal Processing Reference
In-Depth Information
Another option to determine formants is by smoothing short time spectra based on
DFT or FFT [ 2 ]. The idea is to obtain a smooth spectrum just as in the case of linear
prediction, which is freed from the waviness caused by the fundamental frequency.
This waviness results in maxima at a distance of F 0 apart due to the harmonics of
the fundamental frequency. Obviously, these maxima can easily be confused with
formants if the spectrum is not smoothed. Smearing of the maxima can be obtained
by convolution with a smoothing function—however, this method is usually not very
precise [ 2 ].
If analysis is based on the spectral appearance, peak-picking starting from a list
of extreme values is needed to decide for the 'right' maxima. This holds for spectral
smoothing or linear prediction spectra. Usually, candidates are first found per speech
frame, then, the evolution over time is taken into consideration by also looking at
neighbouring frames.
Overall, formant tracking is not solved to full satisfaction to the present day [ 27 ].
Among the main difficulties one can name unfavourable signal conditions, in partic-
ular insufficient spectral resolution in the case of neighbouring formants of similar
amplitude. Further, formants are—strictly speaking—only defined for vowels radi-
ated via the mouth. The shunt of the nasal cavity changes the frequency response
of the vocal tract significantly, as novel nasal formants are added and formants may
be compensated by anti-formants, i.e., zeros in the transfer function [ 6 ]. Such com-
pensation may also occur due to zeros in the excitation spectrum G
. In addition,
depending on the speaker and the phonemes—in particular dark vowels—the upper
formantsasof F 3 are too weak in comparison to surrounding noise. And finally, there
exists no ground truth—only gold standards—if algorithms are tested with sponta-
neous human speech. There are, however, some sets as a partition of the TIMIT
corpus—the MSR-UCLA VTR database—that are manually labelled by expert pho-
neticians [ 28 ]. Another standard approach to validity measurement is the usage of
synthesised speech, where formant positions are known [ 29 ]. Obviously, this is less
realistic than comparing performance on real human speech. In a similar way, this
last problem of lacking ground truth also holds for fundamental frequency detection
algorithms.
Due to these problems, formant tracking is still an active field of research, and
new approaches are still introduced, such as biologically inspired algorithms basing
on gammatone filter banks [ 27 ].
The tracking of anti-formants on the other hand is hardly pursued. As these are
also not further considered in this topic, we only refer to a few methods that aim to
commonly describe formants and anti-formants. First is the autoregressive moving
average (ARMA) method [ 30 ]: The auto regressive part deals with the recursive part
of the filter to be determined, i.e., the poles, whereas the moving average part handles
the non-recursive part, i.e., the zeros. A more common method, however, is to use the
reciprocal or logarithmic transfer function and to apply the same methods as for the
poles [ 31 ].
(
z
)
 
Search WWH ::




Custom Search