Graphics Reference
In-Depth Information
important way of human communication. However, despite the tremendous
research on speech recognition done since the late 1950s, emotion is still one of the
huge differences between humans and machines [ 1 ]. Recognizing human emotion
from speech introduces promising applications such as healthcare services, com-
mercial conversations, virtual humans, emotion-based indexing, and information
retrieval.
An utterance (phrase, short sentence, etc.) is often considered to be a funda-
mental unit and is recognized on the basis of the global utterance-wise statistics of
derived segments, so the segment features are transformed into a single feature
vector for each emotional utterance [ 2 - 6 ]. However, in recent research, an
increasing number of scientists and psychologists have been arguing that changes in
emotional activity occur within a very short period of time. Several studies have
emphasized the importance of the temporal dynamics of emotions [ 7 , 8 ]. Further-
more, one study illustrates that emotions are inherently dynamic [ 9 ]. The paper
contains an illustration showing that, within 2.6 s, a person went through several
emotional activities, such as surprise, fear, aggressive stance, and relaxation. In
addition, another study demonstrates that the emotion effect occurs within hundreds
of milliseconds [ 10 ] .
Motivated by these ! findings, we focused on a novel scheme to improve speech
emotion recognition by using segment-level features instead of utterance-wise
features [ 11 - 13 ]. Many researchers have recently been focusing on whether the
utterance-level approach is the right choice for modeling emotions [ 14 ] . They are
concerned with this because of the dif ! culties with utterance-wise statistics in
avoiding in fl uence from spoken content. Moreover, valuable but neglected infor-
mation could be utilized in the segment-level feature extraction approach rather
than calculating only the utterance-wise statistics. This hypothesis is also supported
by many pieces of research [ 15 , 16 ] on the basis of the fact that improvements can
be made by adding segment-level features to the common utterance-level features.
We took into consideration a purely segment-level strategy for recognizing
speech emotion and abandoned utterance-wise features in order to reduce noises
such as spoken content and utilize neglected information when calculating the
utterance-wise statistics in this study. An issue when using segment-level speech
emotion recognition is that it increases the dif ! culty for training to a large extent
because a single utterance is divided into a number of segments. The aim of this
paper is to properly design an approach for recognizing utterance-level emotion that
is based on aggregating the segment-level labels and to extract more information
such as emotion strength. The concept is illustrated in Fig. 1 .
2 Experimental Design for Emotion Database
A well-annotated database is needed to construct a robust method for recognizing
emotions by using speech signals [ 17 ] . Our experiment emphasizes natural
speech. The participants were prevented from becoming aware that they were in an
Search WWH ::




Custom Search