Digital Signal Processing Reference
In-Depth Information
suitable correlates to represent the above mentioned speech quality features. In the
literature, pitch, duration, energy and their derivatives are widely used to represent
prosodic features. These prosodic features are also known as supra-segmental or
long-term features. These terms are interchangeably used in this topic, to represent
prosodic information. Human beings perceive emotions present in speech, by exploit-
ing the prosodic features, and in this studies, these features are explored for classify-
ing the emotions. From the above discussion, the importance to explore the excitation
source, vocal tract system and prosodic features, to capture emotion-specific infor-
mation, is observed. This topic addresses an issue of speech emotion recognition
by exploring the above mentioned emotion-specific features, for discriminating the
emotions. Terms such as recognition performance, classification performance, and
discrimination are used in this topic, in the context of emotions, unless specifically
mentioned.
1.4 Emotional Speech Databases
An important issue to be considered in evaluating emotional speech systems is the
quality of the databases used to develop and assess the performance of the systems
[ 5 ]. The objectives and methods of collecting speech corpora highly vary according
to the motivation behind the development of speech systems. Speech corpora used
for developing emotional speech systems can be divided into 3 types. The important
properties of these databases are briefly mentioned in Table 1.1 .
1. Actor (Simulated) based emotional speech database.
2. Elicited (Induced) emotional speech database.
3. Natural emotional speech database.
Simulated emotional speech corpora are collected from reasonably experienced
and trained theatre or radio artists. Artists are asked to express linguistically neutral
sentences in different emotions. Recording is done in different sessions to consider
the variations in the degree of expressiveness and physical speech production mech-
anism of human beings. This is one of the easier and reliable methods of collecting
expressive speech databases containing a wide range of emotions. More than 60%
of the databases collected for expressive speech research are of this kind. The emo-
tions collected through simulated means are fully developed in nature, which are
typically intense, and incorporate most of the aspects considered relevant for the
emotion [ 21 ]. These are also known as full blown emotions. Generally, it is found
that acted/simulated emotions tend to be more expressed than real ones [ 5 , 22 ].
Elicited speech corpora are collected by simulating the artificial emotional situa-
tion, without the knowledge of the speaker. Speakers are made to involve themselves
in emotional conversation with the anchor, where different contextual situations are
created by the anchor through conversation to elicit different emotions from the sub-
ject, without his/her knowledge. These databases may be more natural than their
simulated counterparts, but subjects may not be properly expressive if they know
 
Search WWH ::




Custom Search