Audio Features - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

Another option to determine formants is by smoothing short time spectra based on

DFT or FFT [ 2 ]. The idea is to obtain a smooth spectrum just as in the case of linear

prediction, which is freed from the waviness caused by the fundamental frequency.

This waviness results in maxima at a distance of F 0 apart due to the harmonics of

the fundamental frequency. Obviously, these maxima can easily be confused with

formants if the spectrum is not smoothed. Smearing of the maxima can be obtained

by convolution with a smoothing function—however, this method is usually not very

precise [ 2 ].

If analysis is based on the spectral appearance, peak-picking starting from a list

of extreme values is needed to decide for the 'right' maxima. This holds for spectral

smoothing or linear prediction spectra. Usually, candidates are first found per speech

frame, then, the evolution over time is taken into consideration by also looking at

neighbouring frames.

Overall, formant tracking is not solved to full satisfaction to the present day [ 27 ].

Among the main difficulties one can name unfavourable signal conditions, in partic-

ular insufficient spectral resolution in the case of neighbouring formants of similar

amplitude. Further, formants are—strictly speaking—only defined for vowels radi-

ated via the mouth. The shunt of the nasal cavity changes the frequency response

of the vocal tract significantly, as novel nasal formants are added and formants may

be compensated by anti-formants, i.e., zeros in the transfer function [ 6 ]. Such com-

pensation may also occur due to zeros in the excitation spectrum G

. In addition,

depending on the speaker and the phonemes—in particular dark vowels—the upper

formantsasof F 3 are too weak in comparison to surrounding noise. And finally, there

exists no ground truth—only gold standards—if algorithms are tested with sponta-

neous human speech. There are, however, some sets as a partition of the TIMIT

corpus—the MSR-UCLA VTR database—that are manually labelled by expert pho-

neticians [ 28 ]. Another standard approach to validity measurement is the usage of

synthesised speech, where formant positions are known [ 29 ]. Obviously, this is less

realistic than comparing performance on real human speech. In a similar way, this

last problem of lacking ground truth also holds for fundamental frequency detection

algorithms.

Due to these problems, formant tracking is still an active field of research, and

new approaches are still introduced, such as biologically inspired algorithms basing

on gammatone filter banks [ 27 ].

The tracking of anti-formants on the other hand is hardly pursued. As these are

also not further considered in this topic, we only refer to a few methods that aim to

commonly describe formants and anti-formants. First is the autoregressive moving

average (ARMA) method [ 30 ]: The auto regressive part deals with the recursive part

of the filter to be determined, i.e., the poles, whereas the moving average part handles

the non-recursive part, i.e., the zeros. A more common method, however, is to use the

reciprocal or logarithmic transfer function and to apply the same methods as for the

poles [ 31 ].

(

z

)

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home