Digital Signal Processing Reference
In-Depth Information
(wild card) means that HMMs are built without considering the transi-
tions, “
U
”, “
D
” or “
F
”. The total number of S-HMM states is the same as
the number of SP-HMM states. Twenty S-HMMs including “
sil
”, “
sp
”
are trained.
2
Training utterances are segmented into syllables by the forced-alignment
technique using the S-HMMs; and then, one of the transition labels,
“
U
”, “
D
” or “
F
”, is manually given to each segment according to its actual
pattern.
3
“P-HMMs (Prosodic HMMs)”, having a single state, are trained by prosodic
features within these segments, according to the
transition label. Eight
separate models,
“
sil
” and “
sp
”, are made. Each P-HMM has a single state, since it has
been found that syllabic contours in Japanese can be approximated by
a line function[4] and that the
value can be expected to be almost
constant in each CV syllable.
4
The S-HMMs and P-HMMs are combined to make SP-HMMs. Gaus-
sian mixtures for the segmental feature stream of SP-HMMs are tied with
corresponding S-HMM mixtures, while the mixtures for the prosodic fea-
ture stream are tied with corresponding P-HMM mixtures. Figure 9-3