A Distributed Architecture for Real-time Dialogue and On-task Learning of Efficient Co-operative Turn-taking - Coverbal Synchrony in Human-Machine Interaction

Graphics Reference

In-Depth Information

Similar to Thórisson (2002a), pitch is further analyzed by a

Prosody Analyzer perception module to compute a more compact

representation of the pitch pattern in a discrete state space, in our

case to support the learning: The most recent tail of speech right

before a silence, the last 300 msecs, is analyzed to detect minimum

and maximum values of the fundamental pitch to produce a tail-slope

pattern of the pitch. Slope is split into semantic categories; in the

present implementation we have used three categories for slope: Up,

Straight and Down according to Formula 1 and three for the relative

value of pitch right before silence: Above, At and Below , as compared

to the average pitch according to Formula 2.

if

m

>

0

05

→

slope

=

Up

ì

∆

pitch

í

(

0

05

m

0

05

)

slope

Straight

m

=

if

−

≤

→

=

∆

msecs

î

if

m

<

0

05

→

slope

=

Down

if

d

Pt

end

Above

ì

>

→

=

í

d

=

pitch

−

pitch

if

(

−

Pt

≤

d

≤

Pt

)

→

end

=

At

end

avg

î

if

d

Pt

end

Below

<

→

=

where Pt is the average ± 10, i.e. pitch average with a bit of tolerance

for deviation.

The primary output of the Prosody Analyzer is a symbolic

representation of the particular prosody pattern identifi ed in this tail

period (see Figure 3). More features could be added into the symbolic

representation, with the obvious side effect of increasing the state space.

The Speech-To-Text module and Text Analyzers deal with speech

recognition. Speech recognition is done incrementally with the best

Figure 3. A window of 9 seconds of spontaneous speech, which includes speech periods

and silences, categorized into descriptive groups for slope and end position relative to

the average pitch. Only slope of the fundamental pitch during the immediate 300 msecs

preceding a silence (indicated by the gray area) is categorized (into Up, Straight, and

Down). (Abscissa: Voice F0 in Hz, as produced in near real-time by Prosodica; mantissa:

Time-Hours/minutes/seconds.)

Coverbal Synchrony in Human-Machine Interaction

Search WWH ::

Custom Search

Home