TTS-driven Synthetic Behavior Generation Model for Embodied Conversational Agents - Coverbal Synchrony in Human-Machine Interaction

Graphics Reference

In-Depth Information

performed over the whole word 'samo' ( only ), since the ending of the

PA-syllable also identifies the ending of the word. The NA-syllables are

used to identify the duration of the preparation/retraction movement

phases. The interval of a retraction/preparation phase is determined

by the beginning of the retraction/preparation word's (word phrase)

NA-syllable, and the word's (word phrases) ending. The retraction/

preparation phases of the sequence are the words 'vedno' ( always ) and

'in' ( and ), and the utterance 'raze'. At the end of each phase, the ECA

will overlay the designated articulated configuration (e.g. pre-stroke

shape, or neutral shape). The temporal-sync deque then temporarily

aligns the movement phases in regard to the pronunciation rate of

the phonemes/visemes, and silences. This is implemented by using

the proposed equations (1-6), and predicted temporal information,

as stored in the Segments layer. Figure 11 represents the absolute

durations, and corresponding relative temporal positioning (in regard

to the beginning of the text sequence). Therefore, the duration of the

preparation phase of CU-1 is determined to be 0.428 s. It is followed

by a stroke that is determined to last 0.287 s. Finally, the stroke shape

of CU-1 has to be maintained for 0.224 s. The execution of CU-2 has to

be withheld for the duration of CU-1, and will start after 0.939 s. The

CU-2 will then begin with a preparation (0.460 s), and a pre-stroke-

hold (0.080 s). This sequence represents the PI-3 unit, and the shape

overlaid is described as right_B1, Fr|Ce|No|O. In a similar way, the

temporal information is added to all other PI and CU units.

In order to re-create the non-verbal behavior as generated by

the system and presented in Figure 11 (conversational shapes), the

system at the end generates EVA-SCRIPT-based behavior description,

containing information about lip-sync and gesture. This final step

is performed by the non-verbal generator deque . The output is then

synthesized by the conversational agent EVA, as symbolically and

temporally co-aligned communicative behavior. The synthetic coverbal

gestures were also evaluated by staff members and students. They

evaluated lip-sync, the symbolic representations of meaningful words,

and the alignment of coverbal gestures with the synthesized speech.

All of the evaluators agreed that the speech and visual pronunciation

was in temporal sync; however, 35% of them suggested improvement

of the correlation between visual and audio stressing, 55% of the

observed sequences adequately represented the verbal content, and

30% of the sequences were observed as a meaningful word mismatch.

Based on verbal information, the evaluators expected another word to

be represented. However, when the meaningful word was suggested to

them, most of them agreed that the representation was adequate, and

appeared more natural. Finally, 15% of the sequences were evaluated

Coverbal Synchrony in Human-Machine Interaction

Search WWH ::

Custom Search

Home