Robotics Reference
In-Depth Information
way that humans use a fusion of audio and visual perception in deciding
what is being said. In the IBM project the computer and camera locate
the person who is speaking, by searching for skin-coloured pixels, for
example, and then using statistical models to detect any object in that
area which resembles a face. Then, with the speaker's face in view, vision
algorithms focus on the mouth region, estimating the location of many
features of the speaker's mouth, including the corners and center of the
lips.
If the camera looked solely at the mouth, only some 12 to 14 sounds
could be distinguished visually, for example the difference between the
“explosive” sound of a “p” at the start of a word and its close relative “b”.
So the visual region scanned by the cameras is enlarged to include many
types of movements, such as jaw movements and movements of the lower
cheek, as well as movements of the tongue and teeth. By combining these
visual features with the audio recognition data, it has proved possible to
increase the accuracy of speech recognition systems.
Although the initial results of the IBM research were promising, a stu-
dio is an ideal environment and often far removed from the conditions
experienced in the real world. Many camera-based systems that work
well in the controlled conditions of a laboratory fail when they are tested
in situations where the lighting is uneven or the speaker is facing away
from the camera. One method of combating such problems is to use an
audiovisual headset, with a tiny camera mounted on a boom, enabling
the mouth region to be monitored constantly, independent of any move-
ment of the head. IBM is also exploring the use of infrared illuminators
for the mouth region to provide a constant level of lighting.
Another solution to the problem of changing video conditions is a
feedback system that changes its confidence levels as it combines audio
and visual features, making its decisions using an evaluation function
similar to those described for game playing programs, 14 on the basis of
the relative weights of the two sources of information. When a speaker
faces away from the microphone, the system's confidence in the lip read-
ing and other visual cues becomes zero—the system simply ignores the
visual information and relies on what it hears. When the visual informa-
tion is strong, it is included. The goal of the IBM system is to do better
than when relying on audio or video information alone. At worst, the
system is as good as audio alone. At best, it is much better.
14 See the section “How Computers Evaluate a Chess Position” in Chapter 3.
Search WWH ::




Custom Search