NOISE ROBUST SPEECH RECOGNITION USING PROSODIC INFORMATION - DSP for In-Vehicle and Mobile Systems - page 145

Digital Signal Processing Reference

In-Depth Information

3.3

Multi-stream Syllable HMMs

3.3.1 Basic Structure of Syllable HMMs. Since CV syllable transition

and the change of characteristics such as “rising”, “falling” and “flat” are

highly related, the segmental and prosodic features are integrated using syllabic

unit HMMs. Our preliminarily experiments showed that the syllable unit HMMs

have approximately the same digit recognition accuracy for a connected digit

task as tied-state triphone HMMs.

The integrated syllable HMM denoted by “SP-HMM (Segmental-Prosodic

HMM)” models both phonetic context and transition. Table 9.1 is the list of

SP-HMMs used in our experiments. Each Japanese digit uttered continuously

with other digits can be modeled by a concatenation of two context-dependent

syllables. Even “2” (/ni/) and “5” (/go/) can be modeled by two syllables since

their final vowel is often lengthened as /ni:/ and /go:/. The context of each

syllable is considered only within each digit in our experiment. Therefore, each

SP-HMM is denoted by either a left-context dependent syllable “ LC-SYL , PM ”

or a right-context dependent syllable “ SYL+RC , PM ”, where “PM” indicates a

transition pattern which is either rising (“ U ”), falling(“ D ”) or flat(“ F ”). For

example, “the first syllable /i/ of “1” (/ichi/) which has rising transition”

is denoted as “ i+chi , U ”. Each SP-HMM has a standard left-to-right topology

with states, where is the number of phonemes in the syllable. “ sil ”

and “ sp ” models are used for representing a silence between digit strings and a

short pause between digits, respectively.

3.3.2 Multi-stream Modeling. SP-HMMs are modeled as multi-stream

HMMs. In the recognition stage, the probability

of generating segmental-

prosodic observation

at state

is calculated by:

where is the probability of generating segmental features and

is the probability of generating prosodic features and are weight-

ing factors for the segmental and prosodic streams, respectively. They are con-

strained by

3.3.3 Building SP-HMMs. Syllable HMMs for segmental and prosodic

features are separately made and combined to build SP-HMMs using a tied-

mixture technique as follows:

1 “S-HMMs (Segmental HMMs)” are trained by using only segmental fea-

tures. They are denoted by either

or

Here,

Next Page

DSP for In-Vehicle and Mobile Systems

Search WWH ::

Custom Search

Home