Digital Signal Processing Reference
In-Depth Information
4.2
VAD under different noise conditions
A prerequisite for a good noise suppression is a robust voice activity
detection (VAD), that adapts to varying background noise conditions. The
energy threshold based VAD approach in the time domain (TVAD), as
shown in Fig. 12-3. has very robust adaptation characteristics. On the other
hand, the approach has shown limited performance on weak, fricative sounds
at word beginnings in the presence of stationary, loud background noise, that
is very similar to the energy distribution of the fricatives.
Therefore, a VAD was implemented in the frequency domain (FVAD),
that combined several features (delta energy, smoothed sum of energy and
peak to average ratio) from the ETSI-approach [7] to a voice activity
decision. This frequency domain approach performed very well under clean
conditions, but it showed to be sensitive against background noise.
In order to gain from the advantages, a combined approach of a time-
domain and a frequency-domain VAD was tested as shown in Fig. 12-5. The
TVAD was used to provide pause boundaries for a robust noise adaptation.
After the noise reduction, the SNR was increased, and the cleaned speech
signal provided stable input conditions for the FVAD. The FVAD is then
used to refine and confirm the pause decisions of the TVAD.
The performance of the VAD approaches was evaluated on a database of
20 command phrases of 13 subjects with varying background noise. There
were three noise conditions: silence, medium and high level fan noise (27, 14
and 5 dB SNR). The performance of the VADs was measured as difference
between manually and automatically determined word boundaries. The
results for the noise conditions for TVAD only, FVAD only and combined
approach are given in Figure 12-6.
As Fig. 12-6 shows, the absolute deviation between the automatically and
manually derived word boundaries (in ms) are divided in the categories that
are given below the figures. Also in Fig. 12-6, the category “No Labs” means
that no boundary was found automatically. The first 3 categories display an
acceptable performance of the VAD.
4.3
Recognition accuracy in the presence of noise
A total of 12 experiments with 37 different speakers and a test vocabulary
of 17 command phrases have been accomplished. Tables 12-1, 12-2, and 12-
3 show the word recognition rates (WRR) for 3 different environmental
conditions and 4 speaker microphone distances (SMD). For each experiment
111 realizations per command phrase have been tested (17*111=1887
Search WWH ::




Custom Search