Digital Signal Processing Reference
In-Depth Information
detected line. Since the moving window has nine frames, the time continuity for
90ms is taken into account in this method.
In conventional extraction methods, values are extracted independently
at every frame and various smoothing techniques are applied afterwards. The
problem of these methods is that they are sensitive to a decrease in correctness
of the raw values. Since our method uses the continuity of cepstral images,
it is expected to be more robust than conventional methods.
Evaluation of
Extraction
2.3
Utterances from two speakers, one male and one female, were selected from
the ATR continuous speech corpus to evaluate the proposed method. Each
speaker uttered 50 sentences. This corpus has correct labels given manually.
White noise, in-car noise, exhibition-hall noise, and elevator-hall noise were
added to these utterances at three SNR levels: 5, 10, and 20dB. Accordingly,
1,200 utterances were made for evaluation.
The correct extraction rate was defined as the ratio of the number of frames
in which extracted values were within ±5% from the correct
values to the
total number of labeled voice frames.
Evaluation results showed that the extraction rate averaged over all noise con-
ditions was improved by 11.2% in absolute value from 63.6% to 74.8%, com-
pared to the conventional method without smoothing.
3. INTEGRATION OF SEGMENTAL AND PROSODIC
INFORMATION FOR NOISE ROBUST SPEECH
RECOGNITION
3.1 Japanese Connected Digit Speech
The effectiveness of the information extracted by the proposed method on
speech recognition was evaluated in a Japanese connected digit speech recog-
nition task. In Japanese connected digit speech, two or three digits often make
one prosodic phrase. Figure 9-1 shows an example of the contour of con-
nected digit speech. The first two digits make the first prosodic phrase, and
the latter three digits make the second prosodic phrase. The transition of
is represented by CV syllabic units, and each CV syllable can be prosodically
labeled as a “rising”, “falling”, or “flat” part. Since this feature changes
at digit boundaries, the accuracy of digit alignment in the recognition process is
expected to be improved by using this information.
Search WWH ::




Custom Search