Robust Emotion Recognition using Pitch Synchronous and Sub-syllabic Spectral Features - Robust Emotion Recognition Using Spectral and Prosodic Features

Digital Signal Processing Reference

In-Depth Information

2.3.1 Linear Prediction Cepstral Coefficients (LPCCs)

The cepstral coefficients derived from either linear prediction (LP) analysis or a filter

bank approach are almost treated as standard front end features. Speech systems

developed based on these features have achieved a very high level of accuracy, for

speech recorded in a clean environment. Basically, spectral features represent pho-

netic information, as they are derived directly from spectra. The features extracted

from spectra, using the energy values of linearly arranged filter banks, equally empha-

size the contribution of all frequency components of a speech signal. In this context,

LPCCs are used to capture emotion-specific information manifested through vocal

tract features. In this work, the 10th order LP analysis has been performed, on the

speech signal, to obtain 13 LPCCs per speech frame of 20 ms using a frame shift

of 10 ms. The human way of emotion recognition depends equally on two factors,

namely: its expression by the speaker as well as its perception by a listener. The pur-

pose of using LPCCs is to consider vocal tract characteristics of the speaker, while

performing automatic emotion recognition [ 6 ].

Cepstrum may be obtained using linear prediction analysis of a speech signal.

The basic idea behind linear predictive analysis is that the n th speech sample can be

estimated by a linear combination of its previous p samples as shown in the following

equation.

(

) ≈

a 1 s

(

−

) +

a 2 s

(

−

) +

a 3 s

(

−

) +···+

a p s

(

−

)

where a 1 ,

a 2 ,

a 3 ···

are assumed to be constants over a speech analysis frame. These

are known as predictor coefficients or linear predictive coefficients. These coefficients

are used to predict the speech samples. The difference of actual and predicted speech

samples is known as an error. It is given by

(

) =

(

) −ˆ

(

) =

(

) −

a k s

(

−

)

where e

(

)

is the error in prediction, s

(

)

is the original speech signal,

(

)

is a

predicted speech signal, a k s are the predictor coefficients.

To compute a unique set of predictor coefficients, the sum of squared differences

between the actual and predicted speech samples has been minimized (error mini-

mization) as shown in the equation below

s n (

E n =

) −

a k s n (

−

)

where m is the number of samples in an analysis frame. To solve the above equation

for LP coefficients, E n has to be differentiated with respect to each a k and the result

is equated to zero as shown below

Robust Emotion Recognition Using Spectral and Prosodic Features

Search WWH ::

Custom Search

Home