Audio Features - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

6.2.1.8 Formants

The term 'formants' refers to resonance frequencies of the human vocal tract. In

particular the lower resonance frequencies of the vocal tract, i.e., formants F 1 and

F 2 , are highly correlated with the phonetic content and allow for mapping of vowels

and diphthongs (specific concatenation of two vowels) in the F 1 , F 2 plane. In several

languages such as Dutch F 3 also plays an important role for the spoken content,

whereas the higher formants are usually more coined by speaker characteristics [ 2 ].

For vowels and non-nasal consonants the transfer function of the vocal tract H

)

can be approximated as an all-pole transfer function (cf. Sect. 6.2.1.5 ). This corre-

sponds to a mere recursive digital filter as realised by linear prediction. The poles

of H

(

are referred to as the formants of the speech signal. When determining the

formants, one usually aims at—in order of relevance—the centre frequency, the band-

width, and the amplitude. Formants are mostly assessed by linear prediction analysis.

Alternative methods based on short time spectra also exist. Thereby, the formants are

can be identified as dominant maxima, e.g., in the spectral envelope or even directly

from the speech signal [ 24 ]. There are, however, a number of difficulties when using

a spectral representation as starting point for formant determination—most domi-

nantly single spikes may exist that exceed the vocal tract's resonance frequencies in

amplitude—e.g., by the fundamental frequency or by noise. Next, these resonance

frequencies or formants can be too close to each other, resulting in them being joined

to a single spectral envelope maximum. These fundamental problems can be eased

by LPC analysis.

Let us now consider formant analysis by linear prediction [ 2 , 25 ]. The purely

recursive filter of the linear prediction fits the smooth envelope of the short time

spectra. Spectral maxima are modelled well—areas of low spectral energy are not.

In the linear model, speech production is modelled by the chain of generation (cf.

Sect. 6.2.1.4 ) starting with the excitation E

(

)

(

)

(periodic or noise), excitation spectrum

(

)

, vocal tract H

(

)

and radiation R

(

)

[ 1 , 19 ]. However, we model the poles

of the spectral function S

of the speech signal. This means, we do not know

of which of the components G

(

)

(

)

, H

(

)

,or R

(

)

the poles found in the transfer

function H LP (

)

of the prediction filter do originate from. H LP (

)

can thus not directly

be assumed as the optimal approximation for H

(

)

. Rather, one has to determine

which of the poles of H LP (

can

first be determined by suitable algorithms such as the Newton-Raphson method:

the algorithm is initiated by an estimate for the first pole and then calculates the

polynomial value and its derivative. Then, in an iterative manner, improved estimates

are calculated. This iteration terminates once the delta of subsequent solutions is

smaller than a set threshold. The polynomial can then be divided by this pole and the

algorithm starts over for the now reduced polynomial until all-poles are determined.

A re-iteration per pole with the overall polynomial helps to ease limited precision in

the first round. One can speed-up this process by using the poles from the last speech

frame for initialisation, as the vocal tract position and by that the poles change

comparably slowly over time. Now, a frequency range validation criterion could be

applied to discard poles which do not belong to formants.

)

belong to formants [ 26 ]. The poles of H LP (

)

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home