NOISE ROBUST SPEECH RECOGNITION USING PROSODIC INFORMATION - DSP for In-Vehicle and Mobile Systems

Digital Signal Processing Reference

In-Depth Information

(wild card) means that HMMs are built without considering the transi-

tions, “ U ”, “ D ” or “ F ”. The total number of S-HMM states is the same as

the number of SP-HMM states. Twenty S-HMMs including “ sil ”, “ sp ”

are trained.

2

Training utterances are segmented into syllables by the forced-alignment

technique using the S-HMMs; and then, one of the transition labels,

“ U ”, “ D ” or “ F ”, is manually given to each segment according to its actual

pattern.

3

“P-HMMs (Prosodic HMMs)”, having a single state, are trained by prosodic

features within these segments, according to the

transition label. Eight

separate models,

“ sil ” and “ sp ”, are made. Each P-HMM has a single state, since it has

been found that syllabic contours in Japanese can be approximated by

a line function[4] and that the

value can be expected to be almost

constant in each CV syllable.

4

The S-HMMs and P-HMMs are combined to make SP-HMMs. Gaus-

sian mixtures for the segmental feature stream of SP-HMMs are tied with

corresponding S-HMM mixtures, while the mixtures for the prosodic fea-

ture stream are tied with corresponding P-HMM mixtures. Figure 9-3

Search WWH ::

Custom Search

Home