A Likelihood-Maximizing Framework for Enhanced In-Car Speech Recognition Based on Speech Dialog System Interaction - Digital Signal Processing for In-Vehicle Systems and Safety

Digital Signal Processing Reference

In-Depth Information

where dependence of the feature vectors Z on the enhancement parameters

x

is

clearly shown. The acoustic score P

is the measure of importance in LIMA

systems as the transcription on which the optimization takes place is assumed to be

known, and therefore, the language model score P

ð

Z

ðxÞÞ

will not change. The aim of

likelihood maximization for MFNS is therefore to optimize the parameters to

maximize the acoustic score of the recognized word sequence w .

An initial decode pass is performed using default enhancement parameters to

generate a state sequence s on which to optimize

ð

w

Þ

. In order to find the optimal

values of x , gradient-based optimization is used on the total log-likelihood of the

observed features, which is defined by

x

X

L

ðxÞ¼

log

ð

P

ð

z i ðxÞj

s i ÞÞ:

(10.2)

i

For a Hidden Markov Model (HMM) speech recognizer using Gaussian mixture

state models (as used in this chapter), the gradient of the total log-likelihood is

given by [ 2 ]

X

X 1

im ð

M

1 g im ðxÞ @

z i ðxÞ

@x

r x L

ðxÞ¼

z i ðxÞm im Þ;

(10.3)

i

m

¼

where

is the a posteriori probability of the m th mixture component in state

s i given the observed feature vector z i ðxÞ

g im ðxÞ

. The mean vector

m

and covariance matrix

P from the acoustic model are required for each state i and mixture component

m in order to calculate the gradient. The remaining term in Eq. 10.3 is the Jacobian

matrix,

, which consists of the partial derivatives of each element of the

feature vector with respect to each of the enhancement parameters. Each Jacobian

element is derived directly from the enhancement procedure (refer to Sect. 10.2.3 ).

Once the gradient-based optimization converges, the new enhancement parameters

are used to generate another set of feature vectors, and a subsequent decode pass is

performed. A new state sequence is generated, and the enhancement parameters are

further optimized for this new state sequence. The process continues until the

recognition likelihood (and state sequence) converges, ensuring joint optimization

of the recognized state sequence and the speech enhancement parameters.

@

z i ðxÞ=@x

10.2.2 Optimization Methods for In-Car ASR

10.2.2.1 Calibrated LIMA Framework

The simplest and most common approach for optimizing the enhancement

parameters is to use a calibration session with a known transcription w C . Previous

studies used a single known utterance for each speaker in order to determine

Digital Signal Processing for In-Vehicle Systems and Safety

Search WWH ::

Custom Search

Home