Digital Signal Processing Reference
In-Depth Information
4.
EXPERIMENTS
4.1
Database
A speech database was collected from 11 male speakers in a clean/ quiet
condition. The database comprised utterances of 2-8 connected digits with an
average of 5 digits. Each speaker uttered the digit strings, separating each string
with a silence period. 210 connected digits and approximately 229 silence peri-
ods were collected per speaker.
Experiments were conducted using the leave-one-out method; data from one
speaker were used for testing while data from all other speakers were used
for training, and this process was rotated for each speaker. Accordingly, 11
speaker-independent experiments were conducted, and a mean word accuracy
was calculated as the measure of recognition performance. All the HMMs were
trained using only clean utterances, and testing data were contaminated with ei-
ther white, in-car, exhibition-hall, or elevator-hall noise at three SNR levels: 5,
10 and 20dB.
4.2 Dictionary and Grammar
In the recognition dictionary, each digit had three variations considering the
transitions. For instance, variations of “1” comprised “ i+chi , U i-chi , U
sp ”, “ i+chi , D i-chi , D sp ”, and “ i+chi , F i-chi , F sp ”. This means
that the transition pattern was not allowed to change within each digit. The
recognition grammar was created so that all digits could be connected without
any restrictions.
4.3 Experimental Results
Training and testing were performed using the HTK[5]. In our preliminary
experiments, the best S-HMM recognition performance (“baseline”) was ob-
tained when the number of mixtures in each S-HMM was four. Experiments for
selecting the optimum number of mixtures for the prosodic stream (P-HMMs)
in SP-HMMs tied with four mixture S-HMMs were conducted, and the best
performance using SP-HMMs was obtained when four mixture P-HMMs were
used. Therefore, in the experiments hereafter, SP-HMMs were tied with four
mixture S-HMMs and four mixture P-HMMs.
Table 9-2 shows the digit accuracy using SP-HMMs in various SNR condi-
tions. “SP-HMM-X” indicates the SP-HMMs using the prosodic feature “ P-X ”.
Accuracies for four kinds of noises are averaged at 20, 10, and 5dB SNR, re-
spectively. The segmental and prosodic stream weights and insertion penalties
Search WWH ::




Custom Search