Graphics Reference
In-Depth Information
decision policy from an associated Learner module. A decision consists
of a state-action pair: the action being selected and the evidence used
in making that action represents the state. Each actor follows its own
action-selection policy, which controls how it explores its actions;
various methods such as å-greedy exploration, guided exploration,
or confidence value thresholds can be used (Sutton and Barto, 1998).
In our system, the Learner module takes the role of a critic. It
consists of the learning method, reward functions, and the decision
policy being learnt. A Learner monitors decisions being made in the
system and calculates rewards based on a reward function, a list of
decision/event pairs, and signals from the environment—in our case
overlapping speech and long silences—and publishes an updated
decision policy (the environment consists of the relevant modules in
the system), which any actor module can subsequently use to base
its decision on.
We use a delayed one-step Q-Learning method according to the
formula:
Q(s, a) = Q(s, a) + C [reward - Q(s, a)]
3
where Q ( s,a ) is the learnt estimated return for picking action a in
state s , and ? is the learning rate. The reward functions—what events
following what actions lead to what reward—is pre-determined in
the Learner's configuration in the form of rules: A reward of x if
event y succeeds at action z . Each decision has a lifetime in which
system events can determine a reward, but the reward can also be
calculated in absence of an event, after its given lifetime has passed
(e.g. no overlapping speech). Each time an action gets a reward, the
return value is recalculated according to Formula 3 and the Learner
broadcasts the new value.
In the current setup, Other-Gives-Turn-Decider-2 (OGTD-2) is an
actor in Sutton's sense (Sutton and Barto, 1998); it decides essentially
what its name implies. This decider is only active in the state I-Want-
Turn. It learns an “optimal” STW, which prevents it from speaking on
top of the other, while minimizing the lag in starting to speak, given
a silence. Each time a Speech-Off signal is detected, OGTD-2 receives
analysis of the pitch in the last part of the utterance preceding the
silence from the Prosody Analyzer. The prosody information is then
used to represent the state for the decision; a predicted safe STW is
selected as the action and the Decision is posted. The end of the STW
determines when, in the future, the participant who currently doesn't
have the turn will start speaking (take the turn). In the case where
the interlocutor starts speaking again before this STW closes, the
Search WWH ::




Custom Search