Graphics Reference
In-Depth Information
end-of-utterance detection, since listener's signals are often emitted on
phrases boundaries. So far, the end of an utterance is signaled when
the key “enter” is typed; in the future an incremental parsing of the
user's typed speech will be used to predict this relevant moment. Other
appropriate moments, in which a backchannel could be provided, are
found by considering the type of words pronounced by the user: a
backchannel is more appropriate after a relevant word, like a noun or
a verb, than after an article. A rule-based approach helps the system
to determine which type of backchannel should be provided after
a specific event, for example, after a pause the system can select a
slight nod or an acoustic response like “mhm”. After every new word,
a probabilistic component computes the probability of emission for
each single backchannel. Such a probability depends on three other
probabilities: the probability of having successfully understood the
new word, the probability to provide a certain backchannel B and the
conditional probability that Max performs the backchannel B with a
level x of understanding. Thanks to this system, Max is able to have
long and coherent conversations with users.
4.2 Acoustic cues-based models
Other researchers have based their backchannel systems on the
recognition of the user's signals. In particular, the systems belonging
to this class look for audio cues. They are based on the concept that
often backchannel signals are provided when the speaker is able to
perceive them more easily. This moment in an interaction corresponds
to a grammatical completion, usually accompanied by a region of low
pitch (Ward, 1996).
A first approach was proposed by Cassell et al. (1994). They
implemented a system able to automatically generate and animate
interactions between two or more virtual agents. The system produces
appropriate and synchronized speech, intonation, facial expressions,
and gestures. The speech is computed by a dialogue planner that
generates both the text and the prosody of the sentences. To drive
lip movement, head motion, shift of the gaze, facial expressions and
gestures, the system uses the speaker/listener relationship, the text,
and the turn conveying intonation. In particular, listener behavior is
generated at utterances boundaries where silence usually occurs. The
system is able to determine if there is enough time for producing a
signal by looking at the timing of phonemes and pauses computed
by the text-to-speech synthesizer.
Pelachaud (2005) presented Greta, an ECA that exhibits non-verbal
behavior synchronized and consistent with speech. The system can
Search WWH ::




Custom Search