Digital Signal Processing Reference
In-Depth Information
Table 2.2 Details of the datasets derived from IITKGP-SESC, for various studies on speech
emotion recognition
Data
set
Purpose and
description
Training data
Testing data
Set1
Session independent
emotion
recognition
The utterances of all 15 text
prompts, recorded from
10 speakers are used in
training. Out of 10
sessions, 8 (1-8)
sessions of each speaker
are used in training.
The utterances of all 15 text
prompts, recorded from 10
speakers are used in testing.
Out of 10 sessions, 2 (9 and 10)
sessions of each speaker are
used in testing.
Set2
Session and text
independent
emotion
recognition
Outof15textprompts,the
utterances of 10 (1-10)
prompts, recorded from
10 speakers are used in
training. Out of 10
sessions, 8 (1-8)
sessions of each speaker
are used in training.
Outof15textprompts,the
utterances of 5 (11-15)
prompts, recorded from 10
speakers are used in testing.
Out of 10 sessions, 2 (9 and 10)
sessions of each speaker are
used in testing.
Set3
Session, text, and
speaker
independent
emotion
recognition
Outof15textprompts,the
utterances of 10 (1-10)
prompts, recorded from
8 (4 males and 4
females) speakers are
used in training. All 10
sessions of each speaker
are used in training.
Outof15textprompts,the
utterances of 5 (11-15)
prompts, recorded from 2
(a male and a female) speakers
are used in testing. All 10
sessions of each speaker are
used in testing.
disgust, fear, happiness, neutral and sadness. Ten linguistically neutral German sen-
tences are chosen for database construction. The database is recorded using the
Sennheiser MKH 40 P48 microphone, with the sampling frequency of 16 kHz.
Samples are stored as 16 bit numbers. Eight hundred and forty (840) utterances
of Emo-DB are used in this work.
In the case of the Berlin database, 8 speakers' speech data is used for training the
models and the remaining 2 speakers' speech data is used for validating the trained
models.
2.3 Feature Extraction
In this work, LPCCs, MFCCs and formant features are used for representing the
spectral information. Sub-syllabic spectral features are derived from the speech seg-
ments of consonants, vowels, and consonant to vowel transitions. Pitch synchronous
spectral features are derived from each pitch cycle of the speech signal. Extraction of
different spectral features, mentioned above is discussed in the following subsections.
 
 
Search WWH ::




Custom Search