USE OF MULTIPLE SPEECH RECOGNITION UNITS IN AN IN-CAR ASSISTANCE SYSTEM - DSP for In-Vehicle and Mobile Systems

Digital Signal Processing Reference

In-Depth Information

framework was realized, based on the concept of several speech recognition

units that run in parallel and use class-based statistical language models or

grammars.

The objective of this chapter is that of investigating on a simple selection

method to choose the most likely output among those provided by a set of

recognition units fed with a common input signal [8]. A corpus of real

spontaneous speech utterances acquired in the car is employed to test the

accuracy of the resulting speech recognizer. The chapter is organized as

follows: section 2 introduces the general system architecture and presents

some details about the principal subsystems; section 3 describes the test

database collected through Wizard-of-Oz (WOZ) and some experiments with

multiple recognition units. In the final section, we draw some conclusions and

describe future developments.

2.

SYSTEM ARCHITECTURE

The general architecture of the VICO system is shown in Figure 6-1,

where the blocks “Front-end processing”, “Recognition engine” and

“Recognizer output selector” constitute the subsystem used in the experiments

described later in this chapter.

The front-end processing is based on robust speech activity detection,

noise reduction and feature extraction. The recognition module is conceived

as a set of Speech Recognition Units (SRU) working in parallel, each one

with its own specialized Language Model (LM), followed by an output

selection module. The aim of this configuration is that of looking for a more

reliable input to the Natural Language Understanding (NLU) module, than

what would be obtained when using a single comprehensive Language Model

(LM) and a related very large vocabulary.

As shown in the figure, we assume that the Dialogue Manager (DM) can

dynamically load new LMs and activate or deactivate the single recognition

units at each dialogue step (i.e. recognition process) according to the context

of the dialogue interaction. If no one of the outputs of the units is judged

reliable, the DM can load new LMs and ask for a further recognition step on

the given input utterance.

Note that the SRUs, once loaded, can be selected to be running at the same

time, which means that a user utterance is being processed in parallel by all

active SRUs in a very efficient manner, this way avoiding the delay that

would be introduced by any equivalent sequential recognition approach.

Search WWH ::

Custom Search

Home