Digital Signal Processing Reference
In-Depth Information
2.3.3 Formant Features
Cepstral coefficients are used as standard front end features for developing various
speech systems, however, they perform poorly with noisy or real life speech. There-
fore the supplementary features along with basic cepstral coefficients are essential
to handle real life speech. The higher amplitude regions of a spectrum, such as for-
mants, are relatively less affected by the noise. K. K. Paliwal et al. have extracted
spectral sub-band centroids from high amplitude regions of the spectrum and used
for noisy speech recognition [ 7 ]. With this viewpoint, formant parameters are used
in this study as the supplementary features to cepstral features. Also note that the
conventional cepstral features utilize only amplitude (energy) information from the
speech power spectrum, whereas the proposed formant features utilize frequency
information as well.
In general, formant tracks represent the sequences of vocal tract shapes, hence
formant analysis using their strength, location and bandwidth may help to extract
vocal tract related emotion specific information from a speech signal. Figure 2.2
shows different spectra for 8 emotions of IITKGP-SESC. The spectra are derived
from the syllable tha from Telugu sentence thallidhandrulanu gauravincha valenu .
In this case, the language, text, speaker and contextual information is maintained the
same. This is speculative from the figure that the variation in the spectra is due to the
emotions. Formant frequencies are very crucial in view of speech perception. Hence,
a slight change in these parameters causes a perceptual difference, which may lead
to manifestation of different emotions. It is evident from Fig. 2.2 that the position
and strength of formants are clearly distinct for different emotions. Spectral peaks
indicate the intensity of specific frequency components (or frequency band). Their
distinctive nature for different emotions is the indication of presence of emotion
specific information. The rate of decrease in spectrum amplitude, as a function of
frequency is known as spectral roll-off or spectral tilt. This happens mainly because
of decreasing strength of harmonics, as the frequency increases. A speaker can induce
more strength into higher harmonics by consciously controlling the glottal vibration.
Abrupt closing of the glottis increases the energy in the higher frequency components.
This leads to the variation in spectral roll-off for different emotions. Figure 2.2 shows
distinct spectral roll-offs for each of the emotions.
Though it is assumed that the bandwidth of a formant does not influence phonetic
information [ 6 ], it represents some speaker specific information. Figure 2.2 depicts
the variation in the formant bandwidths in case of different emotions. Even a slight
variation in the bandwidth may be due to speaker induced emotion specific informa-
tion as speaker, text, language and context related information do remain the same.
Formant bandwidth is the frequency band measured at around 3 dB downward from
the respective formant peak.
 
Search WWH ::




Custom Search