Visual Surface Encoding: A Neuroanalytic Approach (Computer Vision) Part 2

Cortical organization of the Surface Representation

A striking aspect of the cortical representation of depth structure is provided by the results of a functional Magnetic Resonance Imaging (fMRI) study of the cortical responses in the human brain to disparity structure.An example of the activation to static bars of disparity (presented in a dynamic noise field, with a flat disparity plane in the same dynamic noise as the null stimulus) is shown in Figure 9.7. The only patches of coherent activation (at the required statistical criterion level) are in the dorsal retinotopic areas V3A and V3B and as well as in the lateral cortex posterior to V5, in a cortical region identified as KO by the standard kinetic border localizer (Van Oostende et al., 1997), although our stimuli had no kinetics whatever but only static depth structure. Retinotopic analysis reveals that the dominant signal occurs at an extremely foveal eccentricity of only 0.5°.

Functional magnetic resonance imaging (fMRI) flat maps of the posterior pole of the two hemispheres showing the synchronized response to stereoscopic structure (yellowish phases) localized to V3A/B (yellow outlines) and KO (cyan outlines).


FIGURE 9.7 Functional magnetic resonance imaging (fMRI) flat maps of the posterior pole of the two hemispheres showing the synchronized response to stereoscopic structure (yellowish phases) localized to V3A/B (yellow outlines) and KO (cyan outlines).

The study of Figure 9.7 shows that one area (KO) of dorsolateral occipital cortex stands out as responding to depth structure conveyed by pure disparity cues. This result does not, however, resolve whether this cortical region is processing depth structure in general or the more specific subtype of stereoscopic depth structure in particular. In order to do so, we need a paradigm that presents the same depth structure via different depth cues. This was implemented in stimuli depicting a Gaussian bump in stereoscopic, motion, texture, and shading cues in one (test) hemifield. The subjects’ task was to adjust the strength of each cue to match the depth perceived in the stereoscopic version of the stimulus in the other (comparison) hemifield. The fMRI response in the KO region for the constantly stereoscopic (comparison) and the mixed stimulus different cue (test) hemifields is shown in Figure 9.8. Each bar represents the activation for the contrast between full- and quarter-strength versions of the depth cue specified in the legend. The perceived depths for the four cues in the test hemifield were equated for the full-strength stimuli during each scan by performing a depth-matching task against a stereo-defined bump in the comparison hemifield.

(a) An example of the stimulus pairs. Gaussian bumps defined by shading (presented above and below the horizontal meridian in the left/test hemifield) and by disparity (in the right/comparison hemifield). (b) Evidence for a generic depth map in the dorsolateral occipital cortex (average of six brains). Test hemifield: Mean group cortical response to four depth cues (see key) at a dorsolateral occipital cortical location that is a candidate for the generic depth map. Note the similarity of response amplitudes for the four individual depth cues (multicolored upward bars), and no significant response in the modality alternation experiment (yellow downward bar), where disparity and shading cues are counterposed (St - Sh). Comparison hemifield: Disparity response under each condition (except the last, where the disparity was held constant as a control for the modality alternation condition).

FIGURE 9.8 (a) An example of the stimulus pairs. Gaussian bumps defined by shading (presented above and below the horizontal meridian in the left/test hemifield) and by disparity (in the right/comparison hemifield). (b) Evidence for a generic depth map in the dorsolateral occipital cortex (average of six brains). Test hemifield: Mean group cortical response to four depth cues (see key) at a dorsolateral occipital cortical location that is a candidate for the generic depth map. Note the similarity of response amplitudes for the four individual depth cues (multicolored upward bars), and no significant response in the modality alternation experiment (yellow downward bar), where disparity and shading cues are counterposed (St – Sh). Comparison hemifield: Disparity response under each condition (except the last, where the disparity was held constant as a control for the modality alternation condition).

The final condition shown in Figure 9.8 was a confrontation test for alternation between the disparity and shading cues equated for perceived depth. Thus, no response is expected in an area encoding perceived depth. This contrast cancels the signal in the response in both hemifields (yellow bars; Figure 9.8), and the slightly negative one for the purely stereoscopic hemifield may be understood as adaptation to the intensive stereoscopic stimulation. The double dissociation pattern was not seen in any other area of cortex.

The surface correspondence problem

There has recently been substantial interest in the mechanism by which the motion of a surface defined purely by disparity cues is appreciated (Patterson, 1999, 2002; Lu and Sperling, 1995, 2001). For spatiotemporal structure of a stereoscopic surface, there is a fundamental ambiguity about how the surface at one time has transformed toward the surface at a later time. For example, did it move laterally, did it move in depth, or did it move in some combination of the two? This is what we term the surface correspondence problem, a global 4D (i.e., 3D spatiotemporal) generalization of the local correspondence problem (see Figure 9.5): what principle defines which point on the later surface corresponds to a given point on the surface at the earlier time? Much of recent computational neuroscience has been driven by two correspondence problems: (1) the binocular correspondence problem highlighted by Julesz (1971) and (2) its temporal counterpart of the motion correspondence problem delimited by the “Braddick limit” (Braddick, 1974). The temporal case generalizes to the aperture problem of local correspondence over time in motion analysis (Marr, 1982), with its solution via the intersection-of-constraints rule, an instantiation of the rigidity constraint. To these seminal problems we now add the surface correspondence problem, a meld of the stereo and motion correspondence problems, but one that operates at the next level of the processing hierarchy.

Thus, the surface correspondence problem is logically independent from, but interacts with, the two lower-level correspondence problems to achieve a global 4D solution. We expect identification of this problem to start a new series of investigations of constraints underlying its global solution to the perceived dynamics of 3D surface transformations. In particular, there is an ambiguity between the rigid lateral (x-axis) motion of a stereoscopically defined surface (Figure 9.9a) and the nonrigid set of local motions for the same scale surface replacement if the correspondence matching were defined by motion orthogonal to the surface structure (as depicted in Figure 9.9b) for sinusoidal cyclopean surfaces. Our preliminary data suggest that if a stereoscopic sinusoid is alternated between two nearby phases, we may expect to perceive it as moving laterally in a globally rigid solution (Figure 9.9a), despite the fact that the nearest neighbor rule would imply nonrigid shape change (Figure 9.9b; the nearest neighbor rule we implement as a mutual proximity constraint in terms of the tangent chord distance of the minimal spheres touching the two surfaces). In other words, the lateral motion manifests a higher probabilistic “weight.” Further options are considered below.

 (a) Diagram representing two sequential phases (full and dashed curves) of a stereoscopic sinusoidal surface (schematized as a cross-section in z,x space, i.e., a top view). Arrows show some corresponding locations, as required for the percept of lateral motion observed, while the surface waveform alternates between the two phases. (b) A proximity constraint can not explain the observed percept. (c) A surface orientation (slant) constraint would provide the requisite matches to account for the percept in (a). (d) Alternating sinusoid between near and far z-axis positions enforces a percept of z-axis stereomotion.

FIGURE 9.9 (a) Diagram representing two sequential phases (full and dashed curves) of a stereoscopic sinusoidal surface (schematized as a cross-section in z,x space, i.e., a top view). Arrows show some corresponding locations, as required for the percept of lateral motion observed, while the surface waveform alternates between the two phases. (b) A proximity constraint can not explain the observed percept. (c) A surface orientation (slant) constraint would provide the requisite matches to account for the percept in (a). (d) Alternating sinusoid between near and far z-axis positions enforces a percept of z-axis stereomotion.

This novel surface correspondence problem should be clearly distinguished from the issue of the token correspondence problem required to construct surfaces from motion cues (Grimson, 1982; Green, 1986; He and Nakayama, 1994). Such studies are concerned with the correspondence between local luminance elements making up the motion, which then defines the surface structure by means of differential motion cues—the classic structure-from-motion paradigm. In the present case, the surface is defined stereoscopically and then alternates between two depth configurations. The question is, what defines the correspondence between the surface locations at t1 and t2? This question is independent of the elements defining the surface (such as the dots shown in Figure 9.6a): the elements can completely change between t1 and t2 without disrupting the surface correspondence. For example, if there is no change in the surface configuration, no surface motion will be seen even if the dynamic noise elements completely change their location and character between t1 and t2.

In terms of psychophysics, the surface correspondence problem is quantitative rather than qualitative, because there is bound to be a balance point in the jump size where the motion shifts from lateral (x,y) to depth (z) motion of the stereoscopic surface, or stereomotion. However, the interpretation of this transition is that the local correspondence naturally leads to a percept of lateral stereomotion, because the local disparity is changing. In order to see lateral motion of the cyclopean surface, the system must be performing a surface-based correspondence match, as illustrated in Figure 9.9c, for a surface jumping between a t1 configuration and a t2 configuration.

All the closest points in any Euclidean metric in 3D space (regardless of the relative horizontal/vertical scaling) must be reassigned in order to generate a uniform lateral motion. This reassignment may be regarded as the valid signature of some global process operating to generate the lateral motion percept of the depth surface. For example, a surface-based match would be selective for surface slant and, hence, could disambiguate the matches by matching only to neighbors of the same surface slant (Figure 9.9c). Thus, a surface-based match would enforce lateral cyclopean motion because the nearest patch of surface with the same slant is always in the pure lateral direction (for a purely lateral displacement of the computed surface). In particular, the surface slant constraint would interdict the stability of any intersection point between the surfaces because the surfaces have opposite slants at the intersection points.

The evidence reviewed in this overview points toward the key role of the surface representation in providing the “glue” or “shrink-wrap” to link the object components in their appropriate relationships. It also emphasizes the inherent three-dimensionality of this surface shrink-wrap.

The fact that the replacement of sinusoidal disparity structure is perceived as motion of that structure (see Figure 9.9a) raises the question of whether a stereoscopic structure moving in 3D space is processed by the same cortical mechanism as static disparity structure, and whether the z-axis motion of such structure is processed by the same mechanism as its y-axis motion. To resolve this question, we needed a paradigm to dissociate responses to a 3D structure when static, laterally moving (y-axis motion) or moving in depth (z-axis motion). If the corresponding three stimuli activated the same cortical site, that would not help with the answer, because such a site could be the basis for a variety of inferences about the neural substrate for the depth structure. But if any two stimuli activate. But if any two stimuli activate different cortical sites, it provides strong evidence of a difference in the underlying neural mechanisms.

Occipital flat map for the left hemisphere of one subject shows distinct locations of significant activation (yellowish patches). Full-colored outlines show retinotopic areas as in Figure 9.7. Dark blue outline: boundary of hMT+ defined by a motion localizer. Dashed outlines are for comparison of clusters of activated voxels across the three conditions. (a) Stereoscopic structure of a static sinusoidal disparity versus a flat disparity-plane activates a region in the dorsolateral cortex. (b) Frontoparallel (/-axis) stereomotion of the sinusoidal stereoscopic surface contrasted with a flat plane activates a swath of cortex including hMT+ (green arrow), together with two sites from the previous two conditions: the depth-structure region similar to (a) (white arrow) and the ventral site seen in (c) (cyan arrow). (c) Z-axis stereomotion versus /-axis stereomotion of the same stereoscopic sinusoidal surface activates regions anterior (yellow arrow; cyclopean stereomotion area CSM) and ventral (cyan arrow) to hMT+.

FIGURE 9.10 Occipital flat map for the left hemisphere of one subject shows distinct locations of significant activation (yellowish patches). Full-colored outlines show retinotopic areas as in Figure 9.7. Dark blue outline: boundary of hMT+ defined by a motion localizer. Dashed outlines are for comparison of clusters of activated voxels across the three conditions. (a) Stereoscopic structure of a static sinusoidal disparity versus a flat disparity-plane activates a region in the dorsolateral cortex. (b) Frontoparallel (/-axis) stereomotion of the sinusoidal stereoscopic surface contrasted with a flat plane activates a swath of cortex including hMT+ (green arrow), together with two sites from the previous two conditions: the depth-structure region similar to (a) (white arrow) and the ventral site seen in (c) (cyan arrow). (c) Z-axis stereomotion versus /-axis stereomotion of the same stereoscopic sinusoidal surface activates regions anterior (yellow arrow; cyclopean stereomotion area CSM) and ventral (cyan arrow) to hMT+.

In fact, robust differential activations are observed in cortical areas in lateral cortex, as illustrated in Figure 9.10. The interpretations are that both / and z stereomotion are processed separately from depth structure, per se; that frontoparallel stereomotion is processed similarly to lateral luminance-based motion in hMT+ but also activates a second ventral site; and that z-axis stereomotion is encoded at a different level in the processing hierarchy that also includes the ventral site.

The implications of the processing sequence for surface interpolation and stereomotion may be captured in a flow diagram (Figure 9.11). Local disparity is known to be processed in V1, although not probed by the experiments of Figure 9.10, because local disparity activation was equated in all three comparisons. The disparity signals must reach the dorsal region for depth interpolation of the surface structure in the static disparity image.

Flow diagram of the processing implied by the activations in Figure 9.10. Boxes show processing stages with labels indicating their cortical sites. Arrows represent connecting pathways, dashed when speculative. Vertical dashed lines indicate logical sequence separators based on prior studies.

FIGURE 9.11 Flow diagram of the processing implied by the activations in Figure 9.10. Boxes show processing stages with labels indicating their cortical sites. Arrows represent connecting pathways, dashed when speculative. Vertical dashed lines indicate logical sequence separators based on prior studies.

The ventral region is activated only by temporal changes in the surface structure and is therefore likely to be a temporal comparator mechanism operating for both types of depth motion. Such a mechanism would need input from a surface interpolation mechanism, but we do not have direct evidence of a pathway connecting the dorsal and ventral regions.

The lack of dorsal activation in Figure 9.10c implies not that there was no activation by surface structure but that there was no change in the surface structure to generate activation as the motion direction changed from z-axis (test) to y-axis (control); in contrast, the conditions for Figure 9.10a,b involved changes in depth structure, as the control was a flat stereosurface. The final element of the flow diagram of Figure 9.11 is a split into separate representations for the y-axis (hMT+ region) and z-axis (CSM region) directions of cyclopean motion. The separation is mandated by the fMRI activation, but the connections remain speculative (as did those of Maunsell and Van Essen, 1983, in their early neurophysiological studies). Further manipulations will be required to resolve all the details of the flow diagram.

Conclusion

The concept of surface representation requires a surface interpolation mechanism to represent the surface in regions of the visual field where the information is undefined. Whereas a typical receptive-field summation mechanism shows a stronger response as the amount of stimulus information increases, the characteristic of an interpolation mechanism is to increase its response as stimulus information is reduced and more extended interpolation is required. The cortical locus of the interpolation mechanism may therefore be sought by identifying fMRI activation sites that paradoxically increase their response to a depth structure stimulus (or do not decrease their response significantly) as the density of luminance information is reduced. Based on the evidence that there is a single depth interpolation mechanism for all visual modalities, experiments should be conducted for disparity, motion, and texture density cues to the same depth structure, to test the prediction that the same cortical interpolation site will show increased response as dot density is decreased for all three depth structure cues.

In conclusion, the primary outcome of this review is the concept of 3D surface interpolation, that is, that the predominant mode of spatial processing is through a self-organizing surface representation (or attentional shroud, see Ch. 0) within a full 3D spatial metric. It is not until such a surface representation is developed that the perceptual system seems to be able to localize the components of the scene.This view is radically opposed to the more conventional concept that the primary quality of visual stimuli is their location, with other properties attached to this location coordinate (Marr, 1982). By contrast, the concept of the attentional shroud is a flexible network for the internal representation of the external object structure. In this concept, the attentional shroud is, itself, the perceptual coordinate frame. It organizes (“shrink-wraps”) itself to optimize the spatial interpretation implied by the complex of binocular and monocular depth cues derived from the retinal images. It is not until this depth reconstruction process is complete that the coordinate locations can be assigned to the external scene. In this sense, localization is secondar/ to the full depth representation of the visual input. Spatial form, usually seen as a predominantly 2D property that can be rotated into the third dimension, becomes a primary 3D concept of which the 2D projection is a derivative feature.

The net result of this analysis is to offer a novel insight into the nature of the binding problem. The separate stimulus properties and local features are bound into a coherent object by the “glue” of the global 3D surface representation. This view is a radical counterpoint to the concept of breaking the scene down into its component elements by means of specialized receptive fields and recognition circuitry. The evidence reviewed in this overview emphasizes the inherent three-dimensionalit/ of the surface “shrink-wrapping” process by the attentional shroud in the form of a prehensile matrix that can cohere the object components whose images are projected onto the sensorium. Such active binding processes are readily implementable computationally with plausible neural components that could reside in a locus of 3D reconstruction in the human cortex.

Next post:

Previous post: