Robotics Reference
In-Depth Information
amounts of background noise, etc., but also the same speaker will often
say the same word at different pitches, speaking quickly on one occasion
and more slowly on another, sometimes with the effects of a sore throat
or a cold, all these variations and many more creating changes in how
the same spoken word is perceived on different occasions. So automatic
speech recognition is most certainly not an easy task, and explaining in
simple language how the technology of speech recognition works is al-
most as difficult as the technology is complex. Of necessity, therefore,
the following explanation is a drastic simplification of how computers
recognize speech 16 and it describes only one of the ways in which the
recognition process works.
Let us imagine that each part of the particular word a system is trying
to recognize is always spoken at exactly the same speed. For example,
if we split up the word “elephant” into four parts: “el”, “e”, “ph” and
“ant”, under these idealised conditions the “el” would always be spoken
in exactly the same amount of time, the following “e” would also always
be spoken in exactly the same amount of time, and so on. By compar-
ing the waveform for each segment of the sound in “elephant” with a
database of stored segments, it would then be possible to identify each of
the sounds, and to string them together to recreate the whole word. In
practice such segments are in the region of one-fifth to one-quarter of a
second on average—too long for the comparison process to be effective.
Instead speech sounds are normally divided into much smaller segments,
typically one-hundredth to one-fiftieth of a second in duration. By mak-
ing the comparison with a larger number of shorter segments rather than
with a smaller number of longer segments, the recognition process be-
comes much more accurate.
All speech is made up of strings of speech sounds, called allophones,
and each allophone is represented by a phoneme consisting of one or more
letters or symbols. Thus the sound of the letter “a” in “father” is repre-
sented by the phoneme “aa”, the sound of the “u” in “cut” is represented
by the phoneme “ah”, and the sound of the “oo” in “topic” is represented
by the phoneme “uh”. Most automatic speech recognition systems work
on either a word recognition basis or a phoneme recognition basis. If a
speech system can correctly recognize all of the phonemes in a word, and
in the correct order, then it has recognized the whole word.
16 For a more detailed yet eminently readable account of speech recognition technologies, the
reader is referred to Robert Rodman's topic Computer Speech Technology (Artech House, Boston,
1999).
Search WWH ::




Custom Search