USE OF MULTIPLE SPEECH RECOGNITION UNITS IN AN IN-CAR ASSISTANCE SYSTEM - DSP for In-Vehicle and Mobile Systems

Digital Signal Processing Reference

In-Depth Information

These represent the typical scenarios taken into consideration by the VICO

project.

During recordings, a co-driver was always in the car to describe each goal

the driver had to pursue by voice interacting with the system. The wizard was

at ITC-irst labs, connected to the mobile phone of the car. A specific setup

was designed in order to simulate an interaction as realistic as possible and to

allow a synchronous speech acquisition through two input channels, one

connected to a close-talk head-mounted microphone (denoted as “CT”) and

the other to a far-microphone placed on the ceiling (denoted as “Far”). The

audio prompts were produced by using a commercial text to speech

synthesizer.

The present release includes 16 speakers (8 males + 8 females), that

uttered a total of 1612 spontaneous speech utterances (equivalent to 9150

word occurrences). The total speech corpus duration is 132 minutes (mean

duration of utterance is 4.9 sec) and the total vocabulary size is 918 words.

Note that all of the speakers were naive to the use of this type of systems

and that the wizard behavior was based on an interaction model, previously

defined, that comprised the simulation of recognition errors typical of the

foreseen real scenario. As a result, many sentences include typical

spontaneous speech problems (e.g. hesitations, repetitions, false starts, wrong

pronunciations, etc.) and often consist in many words (in a few cases the input

utterance contained more than 25 words). The realism of the experiment is

also shown by the fact that at the end of the experiment, after more than one

hour, all the speakers declared they were not aware of the fact that a human

was interacting with them.

3.2

Recognition experiments

The present architecture is based on parallel recognizers covering distinct

application domains and/or geographical clusters. The baseline performance,

shown in Table 6-1, is evaluated using a single class-based language model,

trained on a corpus of about 3000 sentences that cover different applications

domains such as navigation, hotel reservation, address book management,

questions about the car. The geographic coverage of this LM, indicated by the

suffix Cgl , is the whole Trentino province, including names of cities, streets,

hotels, restaurants, POIs (churches, castles, museums). Equal probability has

been assigned to all the items within each geographical class. The derived LM

includes about 12000 words and has a Out-Of-Vocabulary (OOV)

rate

(evaluated on the WOZ data) of 1.1 %.

Search WWH ::

Custom Search

Home