Digital Signal Processing Reference
In-Depth Information
framework was realized, based on the concept of several speech recognition
units that run in parallel and use class-based statistical language models or
grammars.
The objective of this chapter is that of investigating on a simple selection
method to choose the most likely output among those provided by a set of
recognition units fed with a common input signal [8]. A corpus of real
spontaneous speech utterances acquired in the car is employed to test the
accuracy of the resulting speech recognizer. The chapter is organized as
follows: section 2 introduces the general system architecture and presents
some details about the principal subsystems; section 3 describes the test
database collected through Wizard-of-Oz (WOZ) and some experiments with
multiple recognition units. In the final section, we draw some conclusions and
describe future developments.
2.
SYSTEM ARCHITECTURE
The general architecture of the VICO system is shown in Figure 6-1,
where the blocks “Front-end processing”, “Recognition engine” and
“Recognizer output selector” constitute the subsystem used in the experiments
described later in this chapter.
The front-end processing is based on robust speech activity detection,
noise reduction and feature extraction. The recognition module is conceived
as a set of Speech Recognition Units (SRU) working in parallel, each one
with its own specialized Language Model (LM), followed by an output
selection module. The aim of this configuration is that of looking for a more
reliable input to the Natural Language Understanding (NLU) module, than
what would be obtained when using a single comprehensive Language Model
(LM) and a related very large vocabulary.
As shown in the figure, we assume that the Dialogue Manager (DM) can
dynamically load new LMs and activate or deactivate the single recognition
units at each dialogue step (i.e. recognition process) according to the context
of the dialogue interaction. If no one of the outputs of the units is judged
reliable, the DM can load new LMs and ask for a further recognition step on
the given input utterance.
Note that the SRUs, once loaded, can be selected to be running at the same
time, which means that a user utterance is being processed in parallel by all
active SRUs in a very efficient manner, this way avoiding the delay that
would be introduced by any equivalent sequential recognition approach.
Search WWH ::




Custom Search