Survey of 3D Human Body Representations

INTRODUCTION

The problem of human body modeling was initially tackled to solve applications related to the film industry or computer games within the computer graphics (CG) community. Since then, several different tools were developed for editing and animating 3D digital body models. Although at the beginning most of those tools were devised within the computer graphics community, nowadays a lot of work proceeds from the computer vision (CV) community. In spite of this overlapped interest, there is a considerable difference between CG and CV human body model (HBM) applications. The first one pursues realistic models of both human body geometry and its associated motion. On the contrary, CV seeks more of an efficient than an accurate model for applications such as intelligent video surveillance, motion analysis, telepresence, 3D video sequence processing, and coding.
Current work is focused on vision-based human body modeling systems. This overview will present some of the techniques proposed in the bibliography, together with their advantages or disadvantages. The outline of this work is as follows. First, geometrical primitives and mathematical formalism, used for 3D model representation, are addressed. Next, a brief description of standards used for coding HBMs is given. Finally, a section with future trends and conclusion is introduced.

3D HUMAN BODY REPRESENTATIONS

Modeling a human body implies firstly the definition of an articulated 3D structure, in order to represent the human body biomechanical features. Secondly, it involves the choice of an appropriate mathematical model to govern the movements of that articulated structure.
Several 3D articulated representations and mathematical formalisms have been proposed in the literature to model both the structure and movements of a human body (Green & Guan, 2004). Generally, a HBM is represented as a chain of rigid bodies, called links, interconnected to one another by joints. Links can be represented by means of sticks (Yoo, Nixon & Harris, 2002; Taylor, 2000), polyhedron (Saito & Hoshino, 2001), generalized cylinders (Sidenbladh, Black & Sigal, 2002), or superquadrics (Marzani, Calais & Legrand, 2001). Ajoint interconnects two links by means of rotational motions about the axes. The number of independent rotation parameters will define the degrees offreedom (DOF) associated with a given joint. Figure 1 presents an illustration of an articulated model defined by 12 links (sticks) and 10 joints. Other HBM representations, which do not follow the aforementioned links-and-joints philosophy, have been also proposed in the literature to tackle specific applications. For example, Douros, Dekker and Buxton (1999) present a technique to represent HBMs as single entities by means of smooth surfaces or polygonal meshes. This kind of representation is only useful as a rigid description of the human body. On the contrary, Plankers and Fua (2003) and Aubel, Boulic, and Thalmann (2000) present a framework that retains an articulated structure represented by sticks, but replaces the simple geometric primitives by soft objects. The result of this soft surface representation is a realistic model where body parts such as chest, abdomen, or biceps muscles are well modeled.
The simplest 3D articulated structure is a stick representation with no associated volume or surface (Liebowitz & Carlsson, 2001). Planar 2D representations, such as cardboard models, have also been widely used (Huang & Huang, 2002). However volumetric representations are preferred when more realistic models need to be generated. In other words, there is a trade-off between accuracy of representation and complexity. The utilized models should be quite realistic, but they should have a low number of parameters in order to be processed in realtime. Table 1 presents a summary of some of the approaches followed in the literature.

Figure 1. Stick representation of an articulated model defined by 22 DOF

Each of the aforementioned geometrical structures is complemented by means of a motion model that governs its movements (Rohr, 1997); the objective is that the full body performs realistic movements. There is a wide variety of ways to mathematically model articulated systems from a kinematics and dynamics point of view. A mathematical model will include the parameters that describe the links as well as information about the constraints associated with each joint. A model that only includes this information is called a kinematics model and describes the possible static states of a system. The state vector of a kinematics model consists of the model state and the model parameters. A system in motion is modeled when the dynamics of the system are modeled as well. A dynamics model describes the state evolution of the system over time. In a dynamics model the state vector includes linear and angular velocities as well as position.
After selecting an appropriate model for a particular application, it is necessary to develop a concise mathematical formulation for a general solution to the kinematics and dynamics problems, which are non-linear problems. Different formalisms have been proposed in order to assign local reference frames to the links. The simplest approach is to introduce joint hierarchies formed by independent articulation of one DOF, described in terms of Euler angles. Hence, the body posture is synthesized by concatenating the transformation matrices associated with the joints, starting from the root. In order to illustrate this notation, let us express the coordinates of point A in the global reference frame associated with the root of the model (see Figure 1):

where A, represents the coordinates of points A relative
knee A A
to the local reference frame placed in the knee-joint; Trans., is the corresponding transformation matrix to express reference frame i in reference framej; this matrix is defined as:

C and S represent the cosine and sine respectively, and ( , , ) are the Euler angles. This kind of matrix concatenation can be used to express every body part in the body global reference frame.

3D HUMAN BODY CODING STANDARDS

In order to animate or interchange HBMs, a standard representation is required. Related standards, such as Web3D H-anim standards, the MPEG-4 face and body animation, as well as MPEG-4 AFX extensions for human-oid animation, allow compatibility between different HBM processing tools (e.g., HBMs created using an editing tool could be animated using another completely different tool).
The Web3D H-anim working group was formed so that developers could agree on a standard naming convention for human body parts and joints. This group has produced the Humanoid Animation Specification standards, describing a standard way of representing humanoids in VRML. These standards allow humanoids created using authoring tools from one vendor to be animated using tools from another. H-anim humanoids can be animated using keyframing, inverse kinematics, performance animation systems, and other techniques. The three main design goals of H-anim standards are:
• Compatibility: Humanoids should be able to display/animate in any VRML compliant browser.
• Flexibility: No assumptions are made about the types of applications that will use humanoids.
• Simplicity: The specification should contain only what is absolutely necessary.
For this reason, a H-anim file defines a hierarchy of Joint nodes, each defining the rotation center of a Joint, which are arranged to form a hierarchy. The most common implementation for a Joint is a VRML Transform node, which is used to define the relationship of each body segment to its immediate parent. Each Joint node can contain other Joint nodes, and may also contain a Segment node, which contains information about the 3D geometry, color, and texture of the body part associated with that joint. Each Segment can also have a number of Site nodes, which define specific locations relative to the segment. Joint nodes may also contain additional hints for inverse-kinematics systems that wish to control the H-Anim figure.
The hierarchy of H-anim Joint and Segment hierarchy is shown in Figure 2.
Furthermore, the MPEG-4 SNHC (Synthetic and Natural Hybrid Coding) group has standardized two types of streams in order to animate avatars:
• The Face/Body Definition Parameters (FDP/BDP) are avatar specific and based on the H-anim specifications.
• The Face/Body Animation Parameters (FAP/BAP) are used to animate face/body models. More specifically, 168 Body Animation Parameters (BAPs) are defined by MPEG-4 SNHC to describe almost any possible body posture. Thus, a single set of FAPs/ BAPs can be used to describe the face/body posture of different avatars. MPEG-4 has also standardized the compressed form of the resulting animation stream using two techniques: DCT based or prediction based. Typical bitrates for these compressed bitstreams are 2 kbps for the case of facial animation or 10 to 30 kbps for the case of body animation.
In addition complex 3D deformations that can result from the movement of specific body parts (e.g., muscle contraction, clothing folds, etc.) can be modeled by using Face/Body Animation Tables (FAT/BATs), which specify sets of vertices that undergo non-rigid motion and a function to describe this motion with respect to the values of specific BAPs/FAPs. However, a significant problem with using such tables is that they are body model-dependent and require a complex modeling stage. In order to solve such problems, MPEG-4 addresses new animation functionalities in the framework of AFX group by including also a generic seamless virtual model definition and bone-based animation. Particularly, the AFX specification describes state-of-the-art components for rendering geometry, textures, volumes and animation. A hierarchy of geometry, modeling, physics and biomechanical models are described, along with advanced tools for animating these models (Figure 3).

Table 1. Human body structure representations

Authors	DOF	Geometrical Model Representation
Delamarre and Faugeras (2001)	22	Truncated cones (arms and legs), spheres (neck, joints, and head), and right parallelepipeds (hands, feet, and torso)
Gavrila (1999)	22	Superquadrics
Barron and	60	Sticks
Kakadiaris (2000)
Cohen, Medioni	32	Generalized cylinders
and Gu (2001)
Ning, Tan, Wang	12	Truncated cones (torso, arms, and legs) and a sphere (head)
and Hu (2004)

Specifically, the new Humanoid Animation Framework, defined by MPEG-4 SNHC (Preda, 2002; Preda & Preteux, 2001) is defined as a biomechanical model in AFX and is based on a rigid skeleton made of bones. The skeleton consists of bones, which are rigid objects that can be transformed (rotated around specific joints), but not deformed. Attached to the skeleton, a skin model is defined, which smoothly follows any skeleton movement.

FUTURE TRENDS AND CONCLUSIONS

Vision-based applications have been growing considerably fast during the last two decades. As a result of that growing, the current technology can tackle—at the moment only under well-defined constraints—complex tasks such as human body modeling. In addition, the knowledge collected during this time from different research areas (e.g., video processing, rigid/articulated object modeling, human body/motion models, etc.) also helps to face up to vision-based human body modeling. However, in spite of all this large amount of work, many issues are still open. Problems such as development of models including prior knowledge, modeling of multiple person environments, and real-time performance still need to be efficiently solved.
Figure 2. The H-anim 1.1 Joint and Segment hierarchy (from H-anim Website). Three sets of joints are identified, classified according to their significance, so that H-anim models of different complexity can be produced. Segments are shown with dark grey color and Sites with light grey color. Each object beginning with l_ (left) has a corresponding object beginning with r_ (right). The chart was produced by J. Eric Mason and Veronica Polo, VR Telecom Inc.

Figure 3. Hierarchy of AFX animation models

In addition to the aforementioned issues, the reduction of the processing time is one of the milestones in the non-rigid object modeling field. It is highly dependent on two factors: on one hand the computational complexity, and on the other hand the current technology. Taking into account the past few years’ evolution, we can say that computational complexity will not be significantly reduced during the next few years. On the contrary, improvements in the current technology have become commonplace (e.g., reduction in acquisition and processing times, increase in the memory size). Therefore, algorithms that nowadays are computationally prohibitive are expected to have good performance with next technologies. The latter gives rise to a promising future for HBM applications and, as an extension, to non-rigid object modeling in general.

KEY TERMS

Articulated Object: Structure composed of two or more rigid bodies interconnected by means of joints. The degrees of freedom associated with each joint define the different structure configurations.
H-Anim: VRML Consortium Charter for Humanoid Animation Working Group. This group has recently produced the International Standard, “Information technology—computer graphics and image processing—human-oid animation (H-anim),” an abstract representation for modeling three-dimensional human figures.
Human Body Modeling: Digital model generally describing the shape and motion of a human body.
MPEG: Moving Picture Experts Group; a group developing standards for coding digital audio and video, as used for example in video CD, DVD, and digital television. This term is often used to refer to media that is stored in the MPEG-1 format.
MPEG-2: A standard formulated by the ISO Motion Pictures Expert Group (MPEG), a subset of ISO Recommendation 13818, meant for transmission of studio-quality audio and video. It covers four levels of video resolution.
MPEG-4: A standard formulated by the ISO Motion Pictures Expert Group (MPEG), originally concerned with similar applications as H.263 (very low bit rate channels, up to 64kbps). Subsequently extended to encompass a large set of multimedia applications, including over the Internet.
MPEG-4 AFX: MPEG-4 extension with the aim to define high-level components and a framework to describe realistic animations and 2D/3D objects.
MPEG-7: A standard formulated by the ISO Motion Pictures Expert Group (MPEG). Unlike MPEG-2 and MPEG-4, which deal with compressing multimedia contents within specific applications, it specifies the structure and features of the compressed multimedia content produced by the different standards, for instance to be used in search engines.
Rotation Matrix: A linear operator rotating a vector in a given space. A rotation matrix has only three degrees of freedom in 3D and one in 2D. It can be parameterized in various ways, usually through Euler angles, yaw-pitch-roll angles, rotation angles around the coordinate axes, and so forth.
Virtual Reality: 3D digital world, simulating the real one, allowing a user to interact with objects as if inside it.
VRML: Virtual Reality Modeling Language, a platform-independent language for virtual reality scene description.