Biomedical Engineering Reference
In-Depth Information
Figure 3.6). Mathematically, the symbols of each primary sound lexicon are a vector quantizer
(Zador, 1963) for the set of S vectors that arise, from all sound sources that are likely to occur,
when each source is presented in isolation (i.e., no mixtures). Among the symbol sets that are
responding to S are some that represent the sounds coming from the attended speaker. This
illustrates the critically important need to design the acoustic front-end so as to achieve this sort
of quasiorthogonalization of sources . By confining each sound feature to a properly selected time
interval (a subinterval of the 8000 samples available at each moment, ending at the most
recent 16 kHz sample), and by using the proper postfiltering (after the dot product with the feature
vector has been computed) this quasiorthogonalization can be accomplished. (Note: This scheme
answers the question of how brains carry out ''independent component analysis'' [Hyv¨rinen et al.,
2001]. They don't need to. Properly designed quasiorthogonalizing features, adapted to the pure
sound sources that the critter encounters in the real world, map each source of an arbitrary mixture
of sources into its own separate components of the S vector. In effect, this is essentially a sort of
''one-time ICA'' feature development process carried out during development and then essentially
frozen (or perhaps adaptively maintained). Given the stream of S vectors, the confabulation
processing which follows (as described below) can then, at each moment, ignore all but the attended
source-related subset of components, independent of how many, or few, interfering sources
are present. Of course, this is exactly what is observed in mammalian audition — effortless
segmentation of the attended source at the very first stage of auditory (or visual or somatosensory,
etc.) perception.
The expectation formed on the next-word acoustic lexicon of Figure 3.7 (which is a huge
structure, almost surely implemented in the human brain by a number of physically separate
lexicons) is created by successive C1Fs . The first is based on input from the speaker model
lexicon. The only symbols (each representing a stored acoustic model for a single word — see
below) that then remain available for further use are those connected with the speaker currently
being attended to.
The second C1F is executed in connection with input from the language module word lexicon
that has an expectation on it representing possible predictions of the next word that the speaker will
produce (this next-word lexicon expectation is produced using essentially the same process as was
described in Section 3.3 in connection with sentence continuation with context). (Note: This is an
example of the situation mentioned above and in the Appendix, where an expectation is allowed to
transmit through a knowledge base.) After this operation, the only symbols left available for use on
the next-word acoustic lexicon are those representing expected words spoken by the attended
speaker. This expectation is then used for the processing involved in recognizing the attended
speaker's next word.
As shown in Figure 3.7, knowledge bases have previously been established (using pure source,
or well-segmented source, examples) to and from the primary sound symbol lexicons with the
sound phrase lexicons and to and from these with the next-word acoustic lexicon. Using these
knowledge bases, the expectation on the next-word acoustic lexicon is transferred (as described
immediately above) via the appropriate knowledge bases, to the sound phrase lexicons, where
expectations are formed; and from these to the primary sound lexicons, where additional expect-
ations are formed. It is easy to imagine that, since each of these transferred expectations is typically
much larger than the one from which it came, that by the time this process gets to the primary sound
lexicons, the expectations will encompass almost every symbol. THIS IS NOT SO! While these
primary lexicon expectations are indeed large (they may encompass many hundreds of symbols),
they are still only a small fraction of the total set of tens of thousands of symbols. Given these
transfers, which actually occur as soon as the recognition of the previous word is completed —
which is often long before its acoustic content ceases arriving, the architecture is prepared for
detecting the next word spoken by the attended speaker.
Search WWH ::




Custom Search