Digital Signal Processing Reference
In-Depth Information
Figure 9-1. An example of contour of Japanese connected digit speech.
3.2
Integration of Segmental and Prosodic Features
Each segmental feature vector has 25 elements consisting of 12 MFCC, their
deltas, and the delta log energy. The window length is 25ms and the frame
interval is 10ms. Cepstral mean subtraction (CMS) is applied to each utterance.
Two prosodic features are computed: one is the value which repre-
sents the transition, and the other is the maximum accumulated voting value
obtained in the Hough transform which indicates the degree of temporal conti-
nuity in the
value is calculated as follows:
is directly computed from the line extracted by the Hough transform.
An example of the time function of the and maximum accumulated
voting values is shown in Figure 9-2. A male speaker's utterance, “9053308”
“3797298”, with white noise added at 20dB SNR is shown. In unvoiced and
pause periods, the fluctuates more than in voiced periods. The maxi-
mum accumulated voting values in unvoiced and pause periods are much smaller
than that in voiced periods. These features are expected to be effective for de-
tecting boundaries between voiced and unvoiced/pause periods.
Search WWH ::




Custom Search