Multimodal Dialogue System Assessment - Man-Machine Dialogue: Design and Challenges

Information Technology Reference

In-Depth Information

- Q = “follow the street on the right?” (question addressed to the system

after the utterance D and meant to assess the proper understanding of D);

- R = “yes” (system's answer showing that the anaphora was correctly

understood and making the assessment positive).

The authors specified seven levels characterizing the scope of the questions

asked. Let us take these levels up here by indicating each time how to extend

the paradigm to be able to use it in a multimodal dialogue.

- Level1=explicit information. It is the marking of an explicit piece

of information in the utterance, the point is to test the good understanding

of the literal utterance given the great variability of spontaneous language.

The examples given by the authors limit themselves to taking up part of the

utterance and asking the user to confirm this part has been understood: D

= “you will take a right after the white buildings with blue shutters” then

Q =“white shutters?” or “blue shutters?” The extension of this principle to

multimodality consists of asking questions on elements on the multimodal

utterance. With D = “put that there” + gesture in (x 1 , y 1 ) + gesture in (x 2 , y 2 ),

we can test the tracking abilities of the multimodality by asking the following

questions Q: “that?” + gesture in (x 1 , y 1 ); “put there?” + gesture in (x 2 , y 2 );

“put that?” + gesture in (x 2 , y 2 ); “put that there?” + gesture in (x 2 , y 2 )+

gesture in (x 1 , y 1 ), etc. The process can appear naive but it allows the system

to simply test the proper matching of gestures with referring expression, which

is an important part of the multimodal fusion. Specific attention is given to

temporal synchronization between the words pronounced and the gestures

generated. Thus, a temporal delay between “that” and the gesture in question Q

could lead, depending on the system, to a positive answer, which would reflect

its robustness in multimodal matching even when the production conditions

deviate, or, on the contrary, to a negative answer reflecting the system's

inability to go beyond a certain temporal gap.

- Level2=implicit information. This level concerns the resolution of

anaphora, ellipses, gaps and other implicit information that can be salvaged

at a syntactic and semantic level. An example would involve: D = “give me

a ticket for Paris and one for Lyon too” and Q = “ticket for Lyon?” The

reference resolution is one of the main aspects of spontaneous multimodality,

so a multimodal DQR will have to obviously account for it. Thus, if we take

up D to be the multimodality universal primitive, “put that there” with two

pointing gestures, the questions Q could introduce precisions on the referents,

starting, for example, from the mention of their category and going so far as to

Search WWH ::

Custom Search

Home