Graphics Reference
In-Depth Information
that the response (c) involving multimodal information is both fast
and effective. The recognition load is also simplified.
5. Multimodal Discourse and Dialogue Tracking
Human beings have proven to be remarkably efficient at using every
modality available when communicating, and dialogue has naturally
evolved as a multimodal activity. It is a logical consequence that speech
technology should evolve in the direction of multimodality. Whereas
the telephone is evidence that speech alone can function in isolation
without any need to see the interlocutor, it is clear that most people
do take advantage of the visual channel when talking to each other.
People even gesture when talking on the telephone!
Cameras are now ubiquitous, and computer memory and CPU
speeds fast enough to enable real-time video and speech processing
on a phone, a tablet or a notebook computer, as evidenced by Siri
and Google Talk, so we should now begin to model the integration
of these several data streams for dialogue processing. We will look
next at the types of information that can be relatively simply extracted
by a combination of camera and microphone and examine how they
might be used to process speech interactivity.
Figure 1 plots six minutes of audiovisual data extracted from a
30-minute conversation between four people sitting round a table
casually chatting. The figure shows both camera and microphone data
aligned, with speech activity represented as bars of color, overlayed
with graphs displaying in this case vertical head movement for each
participant (Campbell and Douxchamps, 2007). Each row of the plot
shows a 60-second activity, taken from the 5th to the 10th minutes of
Day One in the FreeTalk corpus (SSPNet 2010 - FreeTalk).
The plot allows us to visualize the discourse activity without
knowing exactly what is being said, which puts us in the position of a
machine faced with the task of processing human dialogue. However,
it is clear from the plot that much of the activity can be parsed
without any knowledge of the actual underlying speech. From such an
estimation of the function of each utterance in the overall activity, we
can make a better guess at how the dialogue should be processed. For
example, laughter or backchannel utterances might be better treated by
prosodic or voice quality analysis alone, while longer (propositional?)
utterance sequences would need to be processed by speech recognition
and semantic analysis. A glance at the figure above shows that these
two types are readily distinguishable within the overall structure, as
is the dynamics of the discourse.
Search WWH ::




Custom Search