NOISE ROBUST SPEECH RECOGNITION USING PROSODIC INFORMATION - DSP for In-Vehicle and Mobile Systems

Digital Signal Processing Reference

In-Depth Information

Figure 9-1. An example of contour of Japanese connected digit speech.

3.2

Integration of Segmental and Prosodic Features

Each segmental feature vector has 25 elements consisting of 12 MFCC, their

deltas, and the delta log energy. The window length is 25ms and the frame

interval is 10ms. Cepstral mean subtraction (CMS) is applied to each utterance.

Two prosodic features are computed: one is the value which repre-

sents the transition, and the other is the maximum accumulated voting value

obtained in the Hough transform which indicates the degree of temporal conti-

nuity in the

value is calculated as follows:

is directly computed from the line extracted by the Hough transform.

An example of the time function of the and maximum accumulated

voting values is shown in Figure 9-2. A male speaker's utterance, “9053308”

“3797298”, with white noise added at 20dB SNR is shown. In unvoiced and

pause periods, the fluctuates more than in voiced periods. The maxi-

mum accumulated voting values in unvoiced and pause periods are much smaller

than that in voiced periods. These features are expected to be effective for de-

tecting boundaries between voiced and unvoiced/pause periods.

Search WWH ::

Custom Search

Home