Biomedical Engineering Reference
In-Depth Information
transduction process; which is necessarily different for each of these cognitive modalities. Readers
are expected to have a solid understanding of traditional speech signal processing and speech
recognition.
3.4.1
Representation of Multi-Source Soundstreams
Figure 3.5 illustrates an ''audio front end'' for transduction of a soundstream into a string of ''multi-
symbols;'' with a goal of carrying out ultra-high-accuracy speech transcription for a single speaker
embedded in multiple interfering sound sources (often including other speakers). The description of
this design does not concern itself with computational efficiency. Given a concrete design for such a
system, there are many well-known signal processing techniques for implementing approximately
the same function, often orders of magnitude more efficiently. For the purpose of this introductory
treatment (which, again, is aimed at illustrating the universality of confabulation as the mechan-
ization of cognition), this audio front-end design does not incorporate embellishments such as
binaural audio imaging.
Referring to Figure 3.5, the first step in processing is analog speech lowpass filtering (say, with a
flat, zero-phase-distortion response from DC to 4 kHz, with a steep rolloff thereafter) of the high-
quality (say, over 110 dB dynamic range) analog microphone input. Following bandpass filtering,
the microphone signal is sampled with an (e.g., 24-bit) analog to digital converter operating at a
16 kHz sample rate. The combination of high-quality analog filtering, sufficient sample rate (well
above the Nyquist rate of 8 kHz) and high dynamic range, yield a digital output stream with almost
no artifacts (and low information loss). Note that digitizing to 24 bits supports exploitation of the
wide dynamic ranges of modern high-quality microphones. In other words, this dynamic range will
make it possible to accurately understand the speech of the attended speaker, even if there are much
higher amplitude interferers present in the soundstream.
The 16 kHz stream of 24-bit signed integer samples generated by the above preprocessing (see
Figure 3.5) is next converted to floating point numbers and blocked up in time sequence into 8000-
sample windows (8000-dimensional floating point vectors), at a rate of one window for every
10 ms. Each such sound sample vector X thus overlaps the previous such vector by 98% of its length
(7840 samples). In other words, each X vector contains 160 new samples that were not in the
previous X vector (and the ''oldest'' 160 samples in that previous vector have ''dropped off the
left end'').
Figure 3.5
An audio front-end for representation of a multi-source soundstream. See text for details.
Search WWH ::




Custom Search