Digital Signal Processing Reference
In-Depth Information
These represent the typical scenarios taken into consideration by the VICO
project.
During recordings, a co-driver was always in the car to describe each goal
the driver had to pursue by voice interacting with the system. The wizard was
at ITC-irst labs, connected to the mobile phone of the car. A specific setup
was designed in order to simulate an interaction as realistic as possible and to
allow a synchronous speech acquisition through two input channels, one
connected to a close-talk head-mounted microphone (denoted as “CT”) and
the other to a far-microphone placed on the ceiling (denoted as “Far”). The
audio prompts were produced by using a commercial text to speech
synthesizer.
The present release includes 16 speakers (8 males + 8 females), that
uttered a total of 1612 spontaneous speech utterances (equivalent to 9150
word occurrences). The total speech corpus duration is 132 minutes (mean
duration of utterance is 4.9 sec) and the total vocabulary size is 918 words.
Note that all of the speakers were naive to the use of this type of systems
and that the wizard behavior was based on an interaction model, previously
defined, that comprised the simulation of recognition errors typical of the
foreseen real scenario. As a result, many sentences include typical
spontaneous speech problems (e.g. hesitations, repetitions, false starts, wrong
pronunciations, etc.) and often consist in many words (in a few cases the input
utterance contained more than 25 words). The realism of the experiment is
also shown by the fact that at the end of the experiment, after more than one
hour, all the speakers declared they were not aware of the fact that a human
was interacting with them.
3.2
Recognition experiments
The present architecture is based on parallel recognizers covering distinct
application domains and/or geographical clusters. The baseline performance,
shown in Table 6-1, is evaluated using a single class-based language model,
trained on a corpus of about 3000 sentences that cover different applications
domains such as navigation, hotel reservation, address book management,
questions about the car. The geographic coverage of this LM, indicated by the
suffix Cgl , is the whole Trentino province, including names of cities, streets,
hotels, restaurants, POIs (churches, castles, museums). Equal probability has
been assigned to all the items within each geographical class. The derived LM
includes about 12000 words and has a Out-Of-Vocabulary (OOV)
rate
(evaluated on the WOZ data) of 1.1 %.
Search WWH ::




Custom Search