changes, such as the fast tongue tip movement in “nine”, can not by synthesized
by the appearance model. The reason is that only one image is currently used
to model a phoneme, thus the dynamic details are lost.
In the future, we plan to conduct human perception test using more subjects
so that the experiment results can be more statistically rigorous. We need to
identify people whose visual speeches are highly lip-readable, and derive better
motion models from their data (both geometrical and appearance model). We
also plan to investigate methods using increased appearance samples to model
the dynamic appearance changes.