A Distributed Architecture for Real-time Dialogue and On-task Learning of Efficient Co-operative Turn-taking - Coverbal Synchrony in Human-Machine Interaction

Graphics Reference

In-Depth Information

decision policy from an associated Learner module. A decision consists

of a state-action pair: the action being selected and the evidence used

in making that action represents the state. Each actor follows its own

action-selection policy, which controls how it explores its actions;

various methods such as å-greedy exploration, guided exploration,

or confidence value thresholds can be used (Sutton and Barto, 1998).

In our system, the Learner module takes the role of a critic. It

consists of the learning method, reward functions, and the decision

policy being learnt. A Learner monitors decisions being made in the

system and calculates rewards based on a reward function, a list of

decision/event pairs, and signals from the environment—in our case

overlapping speech and long silences—and publishes an updated

decision policy (the environment consists of the relevant modules in

the system), which any actor module can subsequently use to base

its decision on.

We use a delayed one-step Q-Learning method according to the

formula:

Q(s, a) = Q(s, a) + C [reward - Q(s, a)]

3

where Q ( s,a ) is the learnt estimated return for picking action a in

state s , and ? is the learning rate. The reward functions—what events

following what actions lead to what reward—is pre-determined in

the Learner's configuration in the form of rules: A reward of x if

event y succeeds at action z . Each decision has a lifetime in which

system events can determine a reward, but the reward can also be

calculated in absence of an event, after its given lifetime has passed

(e.g. no overlapping speech). Each time an action gets a reward, the

return value is recalculated according to Formula 3 and the Learner

broadcasts the new value.

In the current setup, Other-Gives-Turn-Decider-2 (OGTD-2) is an

actor in Sutton's sense (Sutton and Barto, 1998); it decides essentially

what its name implies. This decider is only active in the state I-Want-

Turn. It learns an “optimal” STW, which prevents it from speaking on

top of the other, while minimizing the lag in starting to speak, given

a silence. Each time a Speech-Off signal is detected, OGTD-2 receives

analysis of the pitch in the last part of the utterance preceding the

silence from the Prosody Analyzer. The prosody information is then

used to represent the state for the decision; a predicted safe STW is

selected as the action and the Decision is posted. The end of the STW

determines when, in the future, the participant who currently doesn't

have the turn will start speaking (take the turn). In the case where

the interlocutor starts speaking again before this STW closes, the

Search WWH ::

Custom Search

Home