Digital Signal Processing Reference
In-Depth Information
where dependence of the feature vectors Z on the enhancement parameters
x
is
clearly shown. The acoustic score P
is the measure of importance in LIMA
systems as the transcription on which the optimization takes place is assumed to be
known, and therefore, the language model score P
ð
Z
ðxÞÞ
will not change. The aim of
likelihood maximization for MFNS is therefore to optimize the parameters to
maximize the acoustic score of the recognized word sequence w .
An initial decode pass is performed using default enhancement parameters to
generate a state sequence s on which to optimize
ð
w
Þ
. In order to find the optimal
values of x , gradient-based optimization is used on the total log-likelihood of the
observed features, which is defined by
x
X
L
ðxÞ¼
log
ð
P
ð
z i ðxÞj
s i ÞÞ:
(10.2)
i
For a Hidden Markov Model (HMM) speech recognizer using Gaussian mixture
state models (as used in this chapter), the gradient of the total log-likelihood is
given by [ 2 ]
X
X
X 1
im ð
M
1 g im ðxÞ @
z i ðxÞ
@x
r x L
ðxÞ¼
z i ðxÞm im Þ;
(10.3)
i
m
¼
where
is the a posteriori probability of the m th mixture component in state
s i given the observed feature vector z i ðxÞ
g im ðxÞ
. The mean vector
m
and covariance matrix
P from the acoustic model are required for each state i and mixture component
m in order to calculate the gradient. The remaining term in Eq. 10.3 is the Jacobian
matrix,
, which consists of the partial derivatives of each element of the
feature vector with respect to each of the enhancement parameters. Each Jacobian
element is derived directly from the enhancement procedure (refer to Sect. 10.2.3 ).
Once the gradient-based optimization converges, the new enhancement parameters
are used to generate another set of feature vectors, and a subsequent decode pass is
performed. A new state sequence is generated, and the enhancement parameters are
further optimized for this new state sequence. The process continues until the
recognition likelihood (and state sequence) converges, ensuring joint optimization
of the recognized state sequence and the speech enhancement parameters.
@
z i ðxÞ=@x
10.2.2 Optimization Methods for In-Car ASR
10.2.2.1 Calibrated LIMA Framework
The simplest and most common approach for optimizing the enhancement
parameters is to use a calibration session with a known transcription w C . Previous
studies used a single known utterance for each speaker in order to determine
Search WWH ::




Custom Search