Digital Signal Processing Reference
In-Depth Information
the simulated emotional speech database in Telugu language (2nd highest spoken
language in India) is collected using radio artists in eight common emotions. This
database is named as the Indian Institute of Technology KharaGPur- Simulated Emo-
tion Speech Corpus (IITKGP-SESC). The emotions present in the database are anger,
disgust, fear, happiness, neutral, sadness, sarcasm, and surprise. The recognition of
speech emotions using different speech features is studied using IITKGP-SESC. In
this topic, mainly three information sources namely (1) Excitation source (2) Vocal
tract system and (3) Prosody are explored for robust recognition of speech emotions.
Most of the emotion recognition studies, based on spectral features, have employed
conventional block processing, where speech features are derived from the entire
speech signal. But, it is known that, the major portion of the speech signal is com-
prised of steady vowel regions, and there is not much variation in the spectral prop-
erties of the signal in this region. Therefore, while extracting spectral features, there
is a scope that the features can be extracted from the speech regions that yield
spectrally non-redundant information. With this motivation, in this work, spectral
features are extracted separately from vowel, consonant and CV transition regions
[ 1 ]. Similarly, to capture finer spectral variations, pitch synchronous spectral features
are used for speech emotion recognition. High amplitude regions of the spectrum
are robust while developing speech systems using noisy or real time speech data.
In this topic, we have used formant features along with other spectral features for
recognizing the emotions. In this work GMMs are used for developing the emotion
models using spectral features. Among different spectral features, LPCCs seem to
perform better in view of discriminating emotions. Recognition performance using
formant features is not appreciable, but formant features in combination with other
features have shown improvement in the performance. In this work, boundaries of
the sub-syllabic segments (consonant, vowel and CV transition) are identified by
using vowel onset points. Spectral features of CV transition regions have shown
recognition performance close to the performance achieved by the spectral features
from the entire utterance. This indicates that the crucial emotion specific informa-
tion is present in CV transition regions of the syllables. Pitch synchronous spectral
features have shown the best recognition performance among various spectral fea-
tures proposed in this work. This may be due to the finer variations in the spectral
characteristics offered by the proposed pitch synchronous spectral features. For the
accurate detection of pitch cycles, the zero frequency filter based method is used in
this work.
Prosodic features are treated as the effective correlates of speech emotions. In
the literature, static prosodic features have been thoroughly investigated for emotion
recognition. However, from the perceptual observation of emotional speech, it is
observed that emotions are gradually manifested through the sequence of phonemes.
This gradual manifestation of emotions may be captured through variations in the
articulator movements, while producing emotions. With this motivation, in this topic,
temporal variations of prosody contours are proposed to capture the emotion specific
information. Global and local prosodic features extracted from sentence, word and
syllables are explored to discriminate the emotions [ 2 ]. The contribution of the above
speech segments in different positions (initial, middle, and final) of the sentences and
Search WWH ::




Custom Search