Digital Signal Processing Reference
In-Depth Information
optimal enhancement parameters for that particular speaker [ 2 , 3 ]. Whilst this
procedure ensures that optimization takes place on a state sequence which is
correct, calibrated LIMA frameworks inherently assume that background noise
conditions do not change between the calibration and testing sessions. This is a
major challenge for in-car speech recognition since vehicular environments are
subjected to continually changing noise levels and conditions which mean calibra-
tion utterances would be required every time noise conditions changed significantly
from the previous optimization. To overcome this, optimized enhancement
parameters could be stored for each common noise condition; however, this still
requires a calibration utterance to be used at some point in the system. Since there is
a wide range of noise conditions, the user would be continually asked to repeat the
adaptation utterance in order to obtain the optimal set of parameters. This operation
is an unnecessary annoyance for the driver and is likely to lead drivers to become
frustrated with the speech dialog system; such emotions could lead to further
repercussions on ASR and driving performance [ 7 ].
An alternative solution is to calibrate once only for each driving session (e.g., a
common startup utterance such as “Start dialog” could be used for adaptation), but
this introduces the risk of inferior recognition in noise conditions significantly
different to those present during calibration.
The calibration framework is also reliant on the words contained in the adaptation
utterance; therefore, it is necessary for the adaptation utterance to be phonetically
balanced and sufficiently long enough to provide as much acoustic model coverage
as possible in order to generalize the optimized enhancement parameters. This is in
direct conflict with the majority of dialog systems which promote simpler linguistic
structures than human conversation and are therefore unlikely to be phonetically
balanced. Thus, a separate utterance unrelated to the dialog transaction is required
which is likely to be seen by the user as a further inconvenience and therefore
impractical for this particular application.
10.2.2.2 Unsupervised LIMA Framework
The unsupervised LIMA framework proposed in [ 2 ] may be a more appropriate
choice for in-car environments. Unsupervised adaptation removes the restriction of
a calibration utterance (thereby making the adaptation process transparent to the
user), and instead, optimization takes place on an utterance-by-utterance basis. The
major issue with the unsupervised operation is that it uses a hypothesized transcrip-
tion, w , rather than the true transcription w C . The hypothesis transcription is highly
reliant on the effectiveness of the underlying acoustic models and state sequence
generated by Viterbi alignment; therefore, the hypothesis transcription is likely to
be less than 100% correct.
Since the true transcription w C is unknown,
it
is possible that states in
the hypothesized transcription
w are incorrect due to misrecognition and frame
alignment errors (N.B. frame alignment errors will occur even when the transcrip-
tion is known a priori, but should be limited). These inaccurate states will lead to the
^
Search WWH ::




Custom Search