Applications in Intelligent Speech Analysis - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

10.4.2.3 Performance

For provision of baseline results, the two pre-dominant architectures within the field

are considered: Firstly, dynamic modelling of LLD as pitch, energy, MFCC, etc.

by HMMs (only emotion). Secondly, static modelling using supra-segmental infor-

mation obtained by statistical functional application to the same LLD on the chunk

level. This is done either by classification for emotion or regression in the case of

interest.

It was decided to entirely rely on two standard publicly available tools widely

used in the community: the Hidden Markov Model Toolkit (HTK) 10 [ 172 ]inthe

case of dynamic modelling, and the WEKA 3 Data Mining Toolkit 11 [ 131 ]inthe

case of static modelling. This ensures easy reproducibility of the results and reduces

description of parameters to a minimum: Unless specified, defaults are used.

Constantly picking the majority class for the two-class emotion task of the 2009

Emotion Challenge would result in an accuracy (WA) of 70.1 %, which we consider

here, while the chance level for UA is simply 50 %, respectively. As instances are

unequally distributed among classes, balancing of the training material to avoid clas-

sifier over-fitting is considered. This can be eased by applying the Synthetic Minority

Oversampling TEchnique (SMOTE) [ 173 ] as data-driven up-sampling. Note that up-

sampling does not have any influence in the case of generative modelling: For each

class one HMM is trained individually and equal priors are assumed. Table 10.19

depicts these results for the two-class emotion task (classification by linear left-right

HMM, one model per emotion, diverse number of states, two Gaussian mixtures,

6

+

4 Baum-Welch re-estimation iterations, Viterbi) by UA and WA. With increased

temporal modelling, i.e., a higher state number, a gradual shift towards preference

of NEG is observed in the considered two-class problem case. In Table 10.20 results

for this 2-class emotion task are further shown employing the whole feature set and

using SVM (SMO learning, linear kernel, pairwise multi-class discrimination). For

SVM, an additional pre-processing step is performed: the features are standardised,

or z -normalised, i.e.,each feature is normalised to have zero mean and variance one.

Table 10.20 shows the influence of these two pre-processing steps and their impact

on the target evaluation measure UA. Note that the order of operations is crucial, as

the standardisation leads to different results if classes are balanced.

Table 10.21 then depicts the results for the interest baseline. The measures for

this task are the Pearson Correlation Coefficient (CC) and the mean linear error

Table 10.19 Baseline results

for 2-class emotion by

dynamic modelling with

HMM

#States

UA [%]

WA [%]

2-class

1

62.3

71.7

3

62.9

57.5

5

66.1

65.3

10

http://htk.eng.cam.ac.uk/docs/docs.shtml

11

http://www.cs.waikato.ac.nz/ml/weka/

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home