Audio Features - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

(

) =

(

) ·

(

) ·

(

) ·

(

) ·

(6.26)

In the logarithmic domain, this product turns into a summation. The signal part that

is owed to E

can be eliminated by high- or band-pass filtering. In the case of

high-pass filtering this requires that these parts are indeed low-frequent, in order

not to cut away formants (cf. Sect. 6.2.1.8 ). This high-pass can be best realised on

the back-transformation of the logarithmised powers of the spectrum into the time-

domain. This leads to the so-called cepstrum, with the independent variable d ,the

'quefrency' [ 7 ]. These names have been artificially created from the terms 'spec-

trum' and 'frequency' by re-ordering of characters. The variable d is a unit of time

that corresponds to the delay in the ACF, which is the reason for the choice of the

same identifier. By applying the logarithm to the power spectrum, the product rela-

tionship of the source signal and the transfer functions turns into a sum relationship.

After the back-transformation to the time domain (i.e., in the cepstrum) the additive

concatenation of the linear source filter model components remains [ 2 ]:

(

)

(

) =

IDFT

[

log

(

) |

]

(6.27)

[

(

) |

(

) |

(

) |

(

) |

]

(6.28)

IDFT

log

(

) +

(

) +

(

) +

(

) +

(6.29)

where (I)DFT is the (Inverse) Discrete Fourier Transformation, and e

(

)

, g

(

)

, h

(

)

and r

(

)

are the equivalents of their capitalised frequency domain counterparts E

(

)

, etc. The cepstrum is real valued, if computed from the amplitude or power spec-

trum, as these are both axis-symmetrical [ 6 ]. The desired high-pass can be obtained

by trimming the cepstrum after the first fundamental period, i.e., at T 0 .

Variations of the classical cepstrum use other back-transformations such as the

Discrete Cosine Transformation (DCT) or PCA for de-correlation.

If one maps the power spectrum onto Mel-frequency scale bands, then takes the

logarithms of the powers of each band, and applies a DCT transformation to the

resulting values, one obtains the Mel-frequency cepstral coefficients (MFCCs). The

mapping onto Mel-frequency scale bands is typically performed by triangular filters

which are equidistantly spaced on the Mel-frequency scale. This scale takes the

physiology of human hearing into account: the frequency resolution of the human

ear is higher for low frequencies and lower for high frequencies; an approximately

logarithmic relationship of the frequency resolution to the absolute frequency exists

[ 5 ]. The Mel-frequency scale Mel

(

)

(

)

is given by:

log 1

700

Mel

(

) =

2595

(6.30)

MFCCs are among the most popular audio features. Usually coefficients 0 up to 16

are used. For speech recognition in particular, coefficients 0-12 are applied most

frequently.

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home