Mechanization of Cognition - Biomimetics: Biologically Inspired Technologies

Biomedical Engineering Reference

In-Depth Information

transduction process; which is necessarily different for each of these cognitive modalities. Readers

are expected to have a solid understanding of traditional speech signal processing and speech

recognition.

3.4.1

Representation of Multi-Source Soundstreams

Figure 3.5 illustrates an ''audio front end'' for transduction of a soundstream into a string of ''multi-

symbols;'' with a goal of carrying out ultra-high-accuracy speech transcription for a single speaker

embedded in multiple interfering sound sources (often including other speakers). The description of

this design does not concern itself with computational efficiency. Given a concrete design for such a

system, there are many well-known signal processing techniques for implementing approximately

the same function, often orders of magnitude more efficiently. For the purpose of this introductory

treatment (which, again, is aimed at illustrating the universality of confabulation as the mechan-

ization of cognition), this audio front-end design does not incorporate embellishments such as

binaural audio imaging.

Referring to Figure 3.5, the first step in processing is analog speech lowpass filtering (say, with a

flat, zero-phase-distortion response from DC to 4 kHz, with a steep rolloff thereafter) of the high-

quality (say, over 110 dB dynamic range) analog microphone input. Following bandpass filtering,

the microphone signal is sampled with an (e.g., 24-bit) analog to digital converter operating at a

16 kHz sample rate. The combination of high-quality analog filtering, sufficient sample rate (well

above the Nyquist rate of 8 kHz) and high dynamic range, yield a digital output stream with almost

no artifacts (and low information loss). Note that digitizing to 24 bits supports exploitation of the

wide dynamic ranges of modern high-quality microphones. In other words, this dynamic range will

make it possible to accurately understand the speech of the attended speaker, even if there are much

higher amplitude interferers present in the soundstream.

The 16 kHz stream of 24-bit signed integer samples generated by the above preprocessing (see

Figure 3.5) is next converted to floating point numbers and blocked up in time sequence into 8000-

sample windows (8000-dimensional floating point vectors), at a rate of one window for every

10 ms. Each such sound sample vector X thus overlaps the previous such vector by 98% of its length

(7840 samples). In other words, each X vector contains 160 new samples that were not in the

previous X vector (and the ''oldest'' 160 samples in that previous vector have ''dropped off the

left end'').

Figure 3.5

An audio front-end for representation of a multi-source soundstream. See text for details.

Search WWH ::

Custom Search

Home