Robust Emotion Recognition using Sentence, Word and Syllable Level Prosodic Features - Robust Emotion Recognition Using Spectral and Prosodic Features

Digital Signal Processing Reference

In-Depth Information

25 dimensional feature vectors is slightly better than the feature vectors with other

dimensions. Here, the dimension 25 for pitch and energy contours is not crucial. The

reduced size of the pitch and energy contours has to be chosen so that the dynamics

of the original contours are retained in their resampled versions. The basic reasons

for reducing the dimensionality of the original pitch and energy contours are (1) the

need for the fixed dimensional input feature vectors for developing the SVMmodels

and (2) the number of feature vectors required for training the classifier has to be

proportional to the size of the feature vector to avoid the curse of dimensionality

(The need of number of feature vectors grows exponentially as the dimensionality

of feature vector increases. Therefore always there should be a proportion between

the number of available feature vectors and their dimensionality). The local duration

pattern is represented by the sequence of normalized syllable durations. Here the

syllable durations are determined using the time interval between successive VOPs

[ 2 ]. The length of duration contour is proportional to the number of syllables present

in the sentence, which leads to feature vectors of unequal lengths. To obtain the

feature vectors of equal length, the length of duration vector is fixed to be 18 (the

maximum number of syllables present in the longest utterance of IITKGP-SESC).

The length for shorter utterances is compensated by zero padding.

3.4.2 Word and Syllable Level Features

The global and local prosodic features extracted from words and syllables help to

analyze the contribution of different segments (sentences, words, and syllables) and

their positions (initial, middle, and final), in the utterance toward emotion recognition.

Word and syllable boundaries are determined automatically, using vowel onset points

[ 3 , 4 ]. Before extracting the features, the words in all the utterances of the database

are divided into three groups namely initial, middle, and final words. Similarly, the

syllables within each group of words are also classified as initial, middle, and final

syllables. While categorizing the words, the length of the words and number of words

in an utterance are taken into consideration. Length of words is measured in terms of

number of syllables. If there are more than 3 words in the utterance and the first word

is monosyllabic, then the first 2 words are grouped as initial words. This is because

monosyllabic words may not be sufficient to capture emotion specific information.

Many times monosyllabic words are not sufficient for the speaker to clearly express

specific emotion. The scheme of grouping of words and syllables into the above

mentioned three groups is given in Table 3.5 . This table contains word and syllable

grouping details for the 15 sentences of IITKGP-SESC. For instance, grouping of

words in the case of (S1, S8, S9) and (S5, S11) is straight forward as there are either

3 or 6 words in the sentences. In the case of S2, out of 5, two words are grouped as

the initial words, as the first word of the sentence is monosyllabic in nature. The last

word, which contains 4 syllables, is treated as the final word, and the remaining two

words are considered as the middle words. Similarly in the case of S3, the first word

is considered as the initial word, as it contains 4 syllables. On the basis of production

Search WWH ::

Custom Search

Home