Modeling and Synthesis of Realistic Visual Speech in 3D - 3D Modeling and Animation: Synthesis and Analysis Techniques for the Human Body

Game Development Reference

In-Depth Information

For the 3D shape extraction of the talking face, we have used a 3D acquisition

system that uses structured light (Eyetronics, 1999). It projects a grid onto the

face, and extracts the 3D shape and texture from a single image. By using a video

camera, a quick succession of 3D snapshots can be gathered. We are especially

interested in frames that represent the different visemes. These are the frames

where the lips reach their extremal positions for that sound (Ezzat and Poggio

(Ezzat et al., 2000) followed the same approach in 2D). The acquisition system

yields the 3D coordinates of several thousand points for every frame. The output

is a triangulated, textured surface. The problem is that the 3D points correspond

to projected grid intersections, not corresponding, physical points of the face.

Hence, the points for which 3D coordinates are given change from frame to

frame. The next steps have to solve for the physical correspondences.

Fitting of the Generic Head Model

Our animation approach assumes a specific topology for the face mesh. This is

a triangulated surface with 2'268 vertices for the skin, supplemented with

separate meshes for the eyes, teeth, and tongue (another 8'848, mainly for the

teeth). Figure 3 shows the generic head and its topology.

The first step in this fitting procedure deforms the generic head by a simple

rotation, translation, and anisotropic scaling operation, to crudely align it with the

neutral shape of the example face. This transformation minimizes the average

distance between a number of special points on the example face and the model

Figure 3. The generic head model that is fitted to the scanned 3D data of

the example face. Left: Shaded version; Right: Underlying mesh.

Search WWH ::

Custom Search

Home