Introduction: The Role of Midlevel Surface Representation in 3D Object Encoding (Computer Vision) Part 2

Surface Representation as a Riemannian Fiber Bundle

Jun Zhang has made the interesting proposal that visual perception can be viewed as an interpretation based on the intrinsic geometry determined by rules of organization of the sensory data (Zhang and Wu, 1990; Zhang, 2005). The general idea is to relate perceptual unity to the concept of intrinsic constancy under a non-Euclidean geometry, which may be extended to visual modalities such as form, motion, color, and depth. The perceptual structure of the visual process can then be described as a fiber bundle, with visual space as the base manifold, the mapping from the world to the cortex as the base connection, the motion system as the tangent fiber, and all other relevant visual modalities as general fibers within the fiber bundle. The cross section of the fiber bundle is the information from the visual scene, an intrinsically invariant (parallel) portion of which represents a visual object. This concept can account for the unity of perceptual binding of the variety of different perceptual cues that are segregated early in the visual process.

Multiple Surface Cues

Studies of surface properties typically focus on surfaces represented by purely stereoscopic cues, but physical surfaces are almost always defined by multiple visual cues. Thus, it is important to treat the multimodal representation of surfaces as a perceptual primary, integrating the properties of the reflectance structures in the world into a unified surface representation:

where dX(x,y) are the independent egocentric distances computed from each of the independent distance cues (S, luminance shading; D, binocular disparity; M, motion parallax; T, texture gradient, etc.), and /() is the operative cue combination rule.

This expression says that the information from these diverse modalities is combined into a unitary depth percept, but it does not specify the combination rule by which they are aggregated. For the commonly proposed Bayesian combination rule,

where a2X are the noise variances for each distance cue.

If surface reconstruction is performed separately in each visual modality (with independent noise sources), the surface distance estimates should combine according to their absolute signal-noise ratios. Signals from the various modalities (S, D, M, T, …, X) would combine to improve the surface distance estimation; adding information about the object profile from a second surface identification modality could never degrade surface reconstruction accuracy.

More realistic, post-Bayesian versions of the combination rule have also been proposed.The main rule by which the area is assigned to a particular border is the occlusion rule. If a disparity-defined structure encloses a certain region, that region is seen as occluding the region outside the border. The border is perceived as “owning” the region inside the enclosure, which is therefore assigned to the same depth as the border. The region outside the enclosure is therefore more distant and is “owned” by the next depth-defined border that is reached beyond the first. These processes must somehow be implemented in the neural representation of the surfaces that we perceive on viewing these images.

In terms of Equation (0.10), the computational issue that needs to be faced is that the cue combination operation works only if each distance cue exists at every point (x,y). However, the only distance cue that is continuously represented across the visual field is that of luminance shading. All the other cues depend for the computation of distance on the existence of local contrast, which in general is sparsely represented across the field. Computationally, therefore, the process would need to incorporate the ability to operate when some subset of the cues, or all the cues, have sparse values in a subset (x,y) of directions in space, as in Equations (0.5) and (0.6). Moreover, there needs to be some mechanism of integrated interpolation in the process of the sparse cue combination.

Surface Representation Through the Attention Shroud

One corollary of this surface reconstruction approach is a postulate that the object array is represented strictly in terms of its surfaces, as proposed by Nakayama and Shimojo (1990). Numerous studies point to a key role of surfaces in organizing the perceptual inputs into a coherent representation. Norman and Todd (1998), for example, show that depth discrimination is greatly improved if the two locations to be discriminated lie in a surface rather than being presented in empty space. This result is suggestive of a surface level of interpretation, although it may simply be relying on the fact that the presence of the surface provides more information about the depth regions to be assessed. Nakayama, Shimojo, and Silverman (1989) provide many demonstrations of the importance of surfaces in perceptual organization. Recognition of objects (such as faces) is much enhanced where the scene interpretation allows them to form parts of a continuous surface rather than isolated pieces, even when the retinal information about the objects is identical in the two cases.

A more vivid representation of the reconstruction process is to envisage it as an attentional shroud,wrapping the dense locus of activated disparity detectors as a cloth wraps a structured object (see Figure 0.6). The concept of the attentional shroud is intended to capture the idea of a mechanism that acts like the soap film of Equation (0.7) in minimizing the curvature of the perceived depth surface consistent with the available disparity information. Concepts of “mirror neurons” imply that there are neural representations of the actions of others implemented in the brain in a form that is both manipulable and adaptive to new situations faced by the observer (Rizzolati et al., 1996; Rizzolatti and Sinigaglia, 2010). The concept of the attentional shroud shares some similarities with the mirror concept, in that it is mirroring the configuration of the surface in the world with an internal surface representation that can match the configurational properties of the surfaces being viewed.

FIGURE 0.6 Depiction of the idea of an attentional shroud wrapping an object, here a camera. The information in the configuration of the shroud conveys the concept of the object shape in a coarse surface representation. The attentional shroud is conceived as a self-organizing manifold drawn to features of the object shape defined by depth cue representations somewhere in the cortex.

Empirical Evidence For Surface Representation in the Brain

Surface representations are often discussed in terms of brightness propagation and texture segmentation, but these are weak inferences toward a true surface representation. Evidence is building from perceptual, psychophysical, neurophysiological, and computational sources in support of a surface-level description operating in the brain. Surface-specific neural coding has been reported early in the visual processing stream (Nienborg et al., 2005; Bredfeldt and Cumming, 2006; Samonds, Potetz, and Lee, 2009) and subsequently appears to be a feature of both the temporal-lobe and parietal-lobe streams of spatial representation in the cortex. In the temporal-lobe stream, neurons responsive to stereoscopic surface orientation have been reported in visual area V4 (Hinkle and Connor, 2002), and in middle temporal areas (MT) (Nguyenkim and DeAngelis, 2003) and medical superior temporal (MST) (Sugihara et al., 2002). Deeper into the temporal lobe, many neurons in the inferior bank of the superior temporal sulcus are selective for the complex shape of stereoscopic surfaces (Sakata et al., 1999; Janssen et al., 2001; Tanaka et al., 2001; Liu, Vogels, and Orban, 2004). Moreover, Joly, Vanduffel, and Orban. (2009) observed depth structure sensitivity from disparity in a small region of macaque inferior temporal cortex, TEs, known to house higher-order disparity selective neurons. Even in the frontal lobe, within ventral premotor cortex, area F5a (the most rostral sector of F5), showed sensitivity for depth structure from disparity. Within this area, 2D shape sensitivity was also observed, suggesting that area F5a processes complete 3D shape and might thus reflect the activity of canonical neurons.

Similarly, several regions of the parietal cortex are involved in the coding of surface shape. At a simple level, a large proportion of neurons in the occipital extension of the intraparietal sulcus of monkey are selective for stereoscopic surface orientation in the third dimension (Shikata et al., 1996; Taira et al., 2000; Tsutsui et al., 2001, 2002). This wealth of studies makes it clear that multimodal surface representation is an important component of the neural hierarchy in both the ventral and dorsal processing streams. Furthermore, Durand et al. (2007) used functional magnetic resonance imaging (fMRI) in monkeys to show that while several intraparietal (IP) areas (caudal, lateral, and anterior IP areas CIP, LIP, and AIP on the lateral bank; posterior and medial IP areas PIP and MIP on the medial bank) are activated by stereoscopic stimuli, AIP and an adjoining portion of LIP are sensitive to the stereoscopic surface shape of small objects. Interestingly, unlike the known representation of 3D shape in macaque inferior temporal cortex, the neural representation in AIP appears to emphasize object parameters required for the planning of grasping movements (Srivastava et al., 2009). This interpretation provides the basis for an understanding of the dual coding of surface shape in both the ventral and dorsal streams.

The dorsal stream would be involved in the 3D properties for the preparation for action, and the ventral stream would be specialized for the processes of semantic encoding and categorization of the objects.

They argued that this region was the first level for the encoding of the generic surface structure of visual objects. Extending along the intraparietal sulcus (IPS), Durand et al. (2009) determined that retinotopic area V7 had a mixed sensitivity to both position in depth and generic depth structure, and the dorsal medial (DIPSM) and the dorsal anterior (DIPSA) regions of the IPS were sensitive to depth structure and not to position in depth. All three regions were also sensitive to 2D shape, indicating that they carry full 3D shape information. Similarly, Georgieva et al. (2009) report the involvement of five IPS regions as well as the dorsal LOC, the posterior inferior temporal gyrus (ITG), and ventral premotor cortex in the extraction and processing of a 3D shape from the depth surface structure of objects in the world.

Conclusion

This overview emphasizes the view of spatial vision as an active process of object representation, in which a self-organization net of neural representation can reach through the array of local depth cues to form an integrated surface representation of the object structure in the physical world being viewed. Such a description is compatible with a realization in the neural networks of the parieto-occipital cortex rather than just an abstract cognitive schema. This conceptualization identifies an active neural coding process that goes far beyond the atomistic concept of local contour or disparity detectors across the field and that can account for some of the dynamic processes of our visual experience of the surface structure of the scene before us. Once the 3D surface structure is encoded, the nature of the elements in the scene can be segmented into the function units that we know as “objects.”

The contributions to this topic develop advanced theoretical and empirical approaches to all levels of the surface representation problem, from both the computational and neural implementation perspectives. These cutting-edge contributions run the gamut from the basic issue of the ground plane for surface estimation through midlevel analyses of the processes of surface segmentation to complex Riemannian space methods of representing and evaluating surfaces. Taken together, they represent a radical new approach to the thorny problem of determining the structure and interrelationships of the objects in the visual scene, and one that holds the promise of a decisive advance both in our capability of parsing scene information computationally and also in our understanding the coding of such information in the neuronal circuitry of the brain.