Digital Signal Processing Reference
In-Depth Information
Fig. 7.2 Normalized
histogram of the start of
utterance (SOU) with respect
to the beginning of the speech
file; the thick black line marks
the median of 0.83 s [ 3 ]
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
5
Start of utterance [s]
For reference, we performed a similar experiment where the TAP system is
replaced by the state of the art: Upon PTS button actuation, the in-car audio system
is muted—i. e., no echo component is added at the microphone—and the unpro-
cessed microphone signal is passed to the ASR engine. Any speech parts preceding
the PTS event are discarded because there are no look-back buffers.
7.5.1 Test Speech Data
The test speech data consisted of a subset of the US-English SpeechDat-Car
connected-digit corpus [ 7 ]. The set comprised 210 utterances spoken by 35
speakers, each utterance containing four to sixteen digits. Since the test files were
artificially degraded with background noise (see next section), we used close-talk
recordings only, which approximately represent clean speech.
As described in [ 3 ], PTS actuation was assumed to occur 0.83 s relative to the
beginning of each test speech file. Since the actual time of the speech onset varied
from file to file, a probabilistic displacement of the SOU with respect to the PTS
event was achieved. The histogram in Fig. 7.2 , which was generated by forced
Viterbi alignment, visualizes the distribution of the speech onset we found in the
test speech files. By assuming the PTS event at the median of the SOUs, both
premature and delayed speech were simulated.
7.5.2 Artificial Degradation with Echo and Noise
We used different loudspeaker source signals to excite the LEM system as well as a set
of vehicle noise files to simulate the disturbance of the desired speech on the
microphone. Two different simulations were performed: In one case, the loudspeaker
 
Search WWH ::




Custom Search