Speech Recognition - Important Concepts in Signal Processing, Image Processing and Data Compression

Digital Signal Processing Reference

In-Depth Information

the phonemes (so phonemes with different left and right context have different

realizations as HMM states); it would use cepstral normalization to normalize for

different speaker and recording conditions; for further speaker normalization it might use

vocal tract length normalization (VTLN) for male-female normalization and maximum

likelihood linear regression (MLLR) for more general speaker adaptation. The features

would have so-called delta and delta-delta coefficients to capture speech dynamics and in

addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip

the delta and delta-delta coefficients and use splicing and an LDA-based projection

followed perhaps by heteroscedastic linear discriminant analysis or a global semitied

covariance transform (also known as maximum likelihood linear transform, or MLLT).

Many systems use so-called discriminative training techniques which dispense with a

purely statistical approach to HMM parameter estimation and instead optimize some

classification-related measure of the training data. Examples are maximum mutual

information (MMI), minimum classification error (MCE) and minimum phone error

(MPE).

Decoding of the speech (the term for what happens when the system is presented with a

new utterance and must compute the most likely source sentence) would probably use the

Viterbi algorithm to find the best path, and here there is a choice between dynamically

creating a combination hidden Markov model which includes both the acoustic and

language model information, or combining it statically beforehand (the finite state

transducer, or FST, approach).

Dynamic time warping (DTW)-based speech recognition

Dynamic time warping is an approach that was historically used for speech recognition

but has now largely been displaced by the more successful HMM-based approach.

Dynamic time warping is an algorithm for measuring similarity between two sequences

which may vary in time or speed. For instance, similarities in walking patterns would be

detected, even if in one video the person was walking slowly and if in another they were

walking more quickly, or even if there were accelerations and decelerations during the

course of one observation. DTW has been applied to video, audio, and graphics - indeed,

any data which can be turned into a linear representation can be analyzed with DTW.

A well known application has been automatic speech recognition, to cope with different

speaking speeds. In general, it is a method that allows a computer to find an optimal

match between two given sequences (e.g. time series) with certain restrictions, i.e. the

sequences are "warped" non-linearly to match each other. This sequence alignment

method is often used in the context of hidden Markov models.

Further information

Search WWH ::

Custom Search

Home