Multimodal Dialogue System Assessment - Man-Machine Dialogue: Design and Challenges

Information Technology Reference

In-Depth Information

10.1.3. Oral dialogue methodologies

More specifically on oral MMD, a certain number of methods have been

suggested [ANT 99, DEV 04, DYB 04, WAL 05, MÖL 07, KÜH 12]. These

make up a sort of reference framework that contains recommendations to

implement user interaction test methods to automatically analyze or

semi-automatically analyze the obtained interaction traces, markers to

determine the assessment metrics or even the principles to create and analyze

the questionnaires filled out by the users. We thus find a few of the methods

used by MMI. Each system assessor can thus pick among this stock to

determine the method(s) he will apply. Indeed, a single test seems to be

insufficient and a genuine assessment seems to need to bring several tests

together. The evaluation campaigns (Evalda/Media: assessment methodology

for understanding within and outside of the dialogue context), the work

groups (MadCow group, speech understanding group, GdR I3) and the

various European project consortia widely use this principle. When several

systems are involved and the assessment is comparative, the operational rules

can be defined so as to better control the assessment quality. The challenge

assessment campaign with its management crossed with the designer roles of

the systems involved [ANT 03] is an example.

The methodology's main propositions are each accompanied with an

original idea that is meant to simplify the implementation of a type of test by

providing it with a means to be operationalized in a specific context. The

paradigm of the MadCow group [HIR 92] thus provides us with the notion of

template that characterized the minimum and maximum answers to a query

and thus make its assessment more rigorous. The Paradise paradigm,

Paradigm for Dialogue System Evaluation [WAL 01], focuses on the

maximization of the user's satisfaction and suggests to try and satisfy the task

as a reference. Another original idea example [LÓP 03] suggests assessing a

system by automatically generating test user utterances, that is modeling the

user's behavior, including his mistakes. In France, this method was taken up

in the Simdial paradigm [ALL 07], in which the deterministic simulation of a

user allows us to automatically assess the system's dialogueic abilities,

notably thanks to the notion of disturbing phenomenon, which, like the noise

in the Wizard of Oz of [RIE 11], allows us to introduce protestations or

rephrasing requests that will allow us to assess the system's general behavior

and robustness. Moreover, the Data-Question-Response (DQR)

methodology, see notably the chapter by J. Zeiliger et al. in [MAR 00],

Search WWH ::

Custom Search

Home