Graphics Reference
In-Depth Information
2. The system interviewing a single person (“Single person”). A
single person should be fairly consistent in behavior, but some
external noise is inevitable since the communication is through
Skype. Significant results with a single person would show that
the system can adapt with a very small set of learning data—a
highly desirable feature for such systems.
3. The system interviewing 10 people (“10 people”). This is the most
complex condition, as there is both individual variation between
participants as well as background noise. Individual variations
could be a confounding factor; getting significant results in this
condition would mean that the system shows robustness to
individual variation. Improvement over time indicates that the
system can learn from, and in spite of, individual differences.
In all conditions, the system is learning to take turn in a “polite”
cooperative manner while striving for the shortest possible silence
between turns. Each evaluation consists of 10 consecutive interviews.
Our learning system, named Askur for convenience, begins the first
interview with no knowledge, and gradually adapts to its interlocutors
throughout the 10 interview sessions.
The goal of the learning system is to learn to take turns with
no speech overlap, yet achieve the shortest possible duration of
silence between speaker turns. To eliminate variations in STW due
to lack of something to say, we have chosen an interview scenario
where the learning agent is the interviewer, in which case it always
has something to say (until it runs out of questions and the interview
is over).
We are aiming at an agent that can adapt its turn-taking behavior
to dialogue in a short amount of time, using incremental perception.
In the evaluations we focus exclusively on detecting turn-giving
indicators in deliberately generated prosody, leaving out the topic of
turn-opportunity detection (i.e. turn transition without prior indication
from the speaker that she's giving the turn), which would, for example,
be necessary for producing human-like interruptions.
A sample of 11 Icelandic volunteers took part in the experiment,
none of whom had interacted with the system before. All subjects
spoke English to the agent, with varying amounts of Icelandic prosody
patterns, which differ from native English-speaking subjects. The
study was done in a partially controlled setup; all subjects interacted
with the system through Skype using the same hardware (computer,
microphone, etc.) but the location was only semi-private and some
background noise was present in all cases.
Search WWH ::




Custom Search