Digital Signal Processing Reference
In-Depth Information
the phonemes (so phonemes with different left and right context have different
realizations as HMM states); it would use cepstral normalization to normalize for
different speaker and recording conditions; for further speaker normalization it might use
vocal tract length normalization (VTLN) for male-female normalization and maximum
likelihood linear regression (MLLR) for more general speaker adaptation. The features
would have so-called delta and delta-delta coefficients to capture speech dynamics and in
addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip
the delta and delta-delta coefficients and use splicing and an LDA-based projection
followed perhaps by heteroscedastic linear discriminant analysis or a global semitied
covariance transform (also known as maximum likelihood linear transform, or MLLT).
Many systems use so-called discriminative training techniques which dispense with a
purely statistical approach to HMM parameter estimation and instead optimize some
classification-related measure of the training data. Examples are maximum mutual
information (MMI), minimum classification error (MCE) and minimum phone error
(MPE).
Decoding of the speech (the term for what happens when the system is presented with a
new utterance and must compute the most likely source sentence) would probably use the
Viterbi algorithm to find the best path, and here there is a choice between dynamically
creating a combination hidden Markov model which includes both the acoustic and
language model information, or combining it statically beforehand (the finite state
transducer, or FST, approach).
Dynamic time warping (DTW)-based speech recognition
Dynamic time warping is an approach that was historically used for speech recognition
but has now largely been displaced by the more successful HMM-based approach.
Dynamic time warping is an algorithm for measuring similarity between two sequences
which may vary in time or speed. For instance, similarities in walking patterns would be
detected, even if in one video the person was walking slowly and if in another they were
walking more quickly, or even if there were accelerations and decelerations during the
course of one observation. DTW has been applied to video, audio, and graphics - indeed,
any data which can be turned into a linear representation can be analyzed with DTW.
A well known application has been automatic speech recognition, to cope with different
speaking speeds. In general, it is a method that allows a computer to find an optimal
match between two given sequences (e.g. time series) with certain restrictions, i.e. the
sequences are "warped" non-linearly to match each other. This sequence alignment
method is often used in the context of hidden Markov models.
Further information
Search WWH ::




Custom Search