A Survey of Listener Behavior and Listener Models for Embodied Conversational Agents - Coverbal Synchrony in Human-Machine Interaction

Graphics Reference

In-Depth Information

3.2 Multimodal signals

As described in Section 2.1.1, backchannels are provided not only

through the visual modality, but also through voice by uttering

paraverbals, words or short sentences (Gardner, 1998; Allwood et

al., 1993). For such a reason to create credible virtual agents, this

type of signals must be taken into account. Bevacqua et al. (2010)

proposed to improve user-agent interaction by introducing multimodal

signals in the backchannels performed by the ECA Greta. Moreover,

they presented a perceptual study with the aim of getting a better

understanding about how multimodal backchannels are interpreted

by users. Like in their previous studies (Bevacqua et al., 2007; Heylen

et al., 2007), video clips of a virtual agent performing context-free

multimodal backchannel signals were shown. The participants were

asked to assign none, one or several meanings to each signal. Again,

the meanings proposed were: agreement, disagreement, acceptance,

refusal, interest, not interest, belief, disbelief, understanding, not

understanding, liking, not liking.

To create videos, seven visual cues (raise eyebrows, nod, smile,

frown, raise left eyebrow, shake and tilt+frown) and eight acoustic

cues (seven vocalizations plus silence: ok, ooh, gosh, really, yeah, no,

m-mh and (silence)) were selected. The visual cues were chosen among

those studied in previous evaluations (Bevacqua et al., 2007; Heylen et

al., 2007), whereas the vocalizations were selected using an informal

listening test (Bevacqua et al. (2010) for more details). The authors

hypothesized that (i) a multimodal signal created by the combination

of visual and acoustic cues representative of a meaning would obtain

the strongest attribution of the given meaning; (ii) sometimes the

meaning conveyed by each acoustic and visual cues is different by

the meaning transmitted by their combination; and (iii) multimodal

signals obtained by the combination of visual and acoustic cues that

have strongly opposite meanings are rated as nonsense (as for instance

nod+no , shake+ok , shake+yeah ).

The evaluation was performed in English and 55 participants accessed

anonymously to it through a web browser where the multimodal signals

were played one at a time. Participants used a bipolar 7-points Likert

scale for each meaning: from -3 (extremely negative attribution) to +3

(extremely positive attribution). Assigning 0 to all dimensions meant

that participants could not find a meaning among those proposed for the

given signal. They could also judge the signal as completely nonsense.

The 95% confidence interval was calculated for all the meanings.

Table 2 reports all signals for which the mean was significantly above

zero (for positive meanings) or below zero (for negative meanings). For

Search WWH ::

Custom Search

Home