How Computers Recognize - Robots Unlimited: Life in a Virtual Age

Robotics Reference

In-Depth Information

amounts of background noise, etc., but also the same speaker will often

say the same word at different pitches, speaking quickly on one occasion

and more slowly on another, sometimes with the effects of a sore throat

or a cold, all these variations and many more creating changes in how

the same spoken word is perceived on different occasions. So automatic

speech recognition is most certainly not an easy task, and explaining in

simple language how the technology of speech recognition works is al-

most as difficult as the technology is complex. Of necessity, therefore,

the following explanation is a drastic simplification of how computers

recognize speech 16 and it describes only one of the ways in which the

recognition process works.

Let us imagine that each part of the particular word a system is trying

to recognize is always spoken at exactly the same speed. For example,

if we split up the word “elephant” into four parts: “el”, “e”, “ph” and

“ant”, under these idealised conditions the “el” would always be spoken

in exactly the same amount of time, the following “e” would also always

be spoken in exactly the same amount of time, and so on. By compar-

ing the waveform for each segment of the sound in “elephant” with a

database of stored segments, it would then be possible to identify each of

the sounds, and to string them together to recreate the whole word. In

practice such segments are in the region of one-fifth to one-quarter of a

second on average—too long for the comparison process to be effective.

Instead speech sounds are normally divided into much smaller segments,

typically one-hundredth to one-fiftieth of a second in duration. By mak-

ing the comparison with a larger number of shorter segments rather than

with a smaller number of longer segments, the recognition process be-

comes much more accurate.

All speech is made up of strings of speech sounds, called allophones,

and each allophone is represented by a phoneme consisting of one or more

letters or symbols. Thus the sound of the letter “a” in “father” is repre-

sented by the phoneme “aa”, the sound of the “u” in “cut” is represented

by the phoneme “ah”, and the sound of the “oo” in “topic” is represented

by the phoneme “uh”. Most automatic speech recognition systems work

on either a word recognition basis or a phoneme recognition basis. If a

speech system can correctly recognize all of the phonemes in a word, and

in the correct order, then it has recognized the whole word.

16 For a more detailed yet eminently readable account of speech recognition technologies, the

reader is referred to Robert Rodman's topic Computer Speech Technology (Artech House, Boston,

1999).

Search WWH ::

Custom Search

Home