Speech Technology and Conversational Activity in Human-Machine Interaction - Coverbal Synchrony in Human-Machine Interaction

Graphics Reference

In-Depth Information

in spite of what is now a high level of technical competence in both

synthesis and recognition components.

3. Use of 'Social Prosody' in a Conversation

Much of the social or interpersonal information in speech is carried

by the prosody and signaled by changes in the intonation, loudness,

rhythm, and tone-of-voice of the speaker. It is also carried by the

numerous backchannel utterances that intersperse a conversation

to show listener feedback. In our JST/ESP corpus of 1500 hours of

everyday conversational interaction (www.speech-data.jp), these

short nonverbal interjections accounted for more than half of the total

number of utterances. These words, like 'ah' and 'um', 'yeah' and

'yeah yeah yeah', are characterized both by phonetic simplicity and

prosodic complexity, perhaps serving principally to carry tone of voice

information simultaneously signaling both speaker affect and relation

to the interlocutor (Campbell and Mokhtari, 2003).

The study of speech prosody has a long history, but much of the

science to date has focused on the relations between the intonation

of syntactic elements in a sentence—i.e. on linguistic content. More

recently, however, the social aspects of spoken interaction have begun

to be studied from a prosodic point of view, and it has been shown that

prosody functions not only to signal the structure and relationships

of morphological, syntactic, and semantic aspects of propositional

content but also simultaneously serves as a messenger for affective

and cognitive information related to speaker participation status in a

discourse and inter-participant relationships (Campbell, 2007).

Traditional studies have been based on read speech. Read speech

and broadcast speech stand in contrast to conversational speech in

that they function primarily to convey text-based information to an

audience that is largely passive. They are one-way processes, as is

present-day dialogue technology. In the case of radio broadcasts, for

example, no real-time feedback from the audience is even possible and

the speaker has no need to take any observable cognitive states of the

listener into account when rendering text as speech. The task is simply

to render the content intelligibly (content that was originally created as

text, and which through its complexity presents a particular prosodic

challenge to the broadcaster, who is therefore usually a highly trained

performer). In the case of a public lecture, however, the audience

may be visible, while passively listening with no right to speak, but

an effective presenter will take into account such cues as small head

movements and facial expression changes that signal understanding,

Search WWH ::

Custom Search

Home