Digital Signal Processing Reference
In-Depth Information
As Figure 2-7 shows, system S1 achieved the lowest WER (i.e., 3.01%)
since the models were matched perfectly to the acoustic condition during
decoding. The WER for S2 was 3.2% using 2 CPU's (1 CPU for digit
recognition, 1 CPU for sniffing acoustic conditions), which was close to the
expected value of 3.23% (Note: in Figure 2-7, we plot system S2 with 2
CPU's even though only 1 ASR engine was used). S3 achieved a WER of
3.6% by using 8 CPU's. When we compare S2 and S3, we see that a relative
11.1% WER improvement was achieved, while requiring a relative 75%
reduction in CPU resources. These results confirm the advantage of using
Environmental Sniffing over an ASR ROVER paradigm.
There are two critical points to consider when integrating Environmental
Sniffing into a speech task. First, and the most important, is to set up a
configuration such as S1 where prior noise knowledge can be fully used to
yield the lowest WER (i.e., matched noise scenario). This will require an
understanding of the sources of errors and finding specific solutions assuming
that there is prior acoustic knowledge. For example, knowing which speech
enhancement scheme or model adaptation scheme is best for a specific
acoustic condition is required. Secondly, a reliable cost matrix should be
provided to the Environmental Sniffing so the subsequent speech task can
calculate the expected performance in making an informed adjustment in the
trade-off between performance and computation. For our experiments, we
considered evaluation results for Environmental Sniffing where it is employed
to find the highest possible acoustic condition so that the correct acoustic
dependent model could be used. This is most appropriate for the goal of
determining a single solution for the speech task problem at hand. If the
expected performance for the system employing Environmental Sniffing is
lower than the performance of a ROVER system, it may be useful to find the
n most probable acoustic condition types among N acoustic conditions. In the
worst case, the acoustic condition knowledge extracted from Environmental
Sniffing could be ignored and the system will reduce to the traditional
ROVER solution. The goal therefore in this section has been to emphasize
that direct estimation of environmental conditions should provide important
information to tailor a more effective solution to robust speech recognition
systems.
Search WWH ::




Custom Search