Graphics Reference
In-Depth Information
versions of the compressed music they download, but typically they do
not—the marketing model has also changed—and multiple low-price
medium-quality purchases, each well below the threshold of pain, have
replaced the single high-value high-quality purchase model of the past.
This change in lifestyle has parallels in speech processing, where
multiple low-quality utterances might be found more often than well-
formed utterances that are closer to text in form. Anyone who has spent
time reading through transcriptions of spontaneous conversations will
recognize that very few of the chunks of speech form well-formed
sentences (or even well-formed phrases) and that this fragmentation
is a defining characteristic of spontaneous interactive two-way social
communication.
In considering improved technologies for a future generation
of spoken dialogue systems, we propose therefore that instead of
one high-quality 'speech Æ text Æ speech' transfer, we might prefer
a sequence of multiple low-complexity, low-quality, high frequency
'niblets'. For example, the sequence—— ”you ... me ... friends ...
okay”— —works well as a functional utterance in the real world
despite its obvious fragmentation (consisting of four 'niblets'), and
its ungrammaticality. In this example, the word 'friends' might be
replaced by the longer phrase 'work well together' without affecting
the niblet count (the processing load). This chunking and collocation
of speech fragments closely resembles the broken and interrupted
forms that are common in interactive and conversational speech, as
illustrated by the Seymour Hersh extract above.
To explain why this type of supposedly ill-formed utterance
processing might be beneficial to future dialogue systems, consider
for example the practical case of the user-manual that comes with a
digital camera purchase, it may contain several hundreds of pages
of information, often in several languages, but at any given time the
user typically only needs to access one small part of it. If the manual
is available online, or provided digitally as part of the camera,
and a speech-enabled graphical interface is to provide the relevant
information, then (a) the request “How do I turn off the flash?” might
be rendered more efficiently in niblets by the customer as (b) “nikon ...
s6000 ... flash ... off ”, and the response from the system is a combination
of an image, an animated gesture, and a few niblets of its own: (c)
“push here ... top part” (two niblets) delivered as speech through a
multimodal interface.
The key point being made here is that rather than a well-formed,
grammatical text-like utterance (a), the fragmented version (b) carries
equivalent information in a more efficient (if less elegant) manner, and
Search WWH ::




Custom Search