Cue Interpretation and Propagation: Flat versus Nonflat Visual Surfaces (Computer Vision) Part 1

Introduction

In this topic, we consider what information is available to the human visual system in cases of flat and nonflat surfaces. In the case of flat surfaces, visual cues can be interpreted without any reference to surface shape; under many conditions, cue integration is well described by a linear rule; moreover, it is possible to propagate cues along the surface to the locations where visual information is poor or missing. In the general case of nonflat surfaces, cue interpretation depends on surface shape. In such situations, the visual system can interpret surface cues in at least two different ways: shape can be ignored (i.e., a planar approximation is used instead) or surface shape can be estimated beforehand to explain away the effect of shape on interpretation of visual cues.

Numerous motor and visual tasks of everyday life require our visual system to operate with dense surface representations. Consider, for example, a simple task of placing a cup on a table. Even if a table surface has no texture and thus no visual cues for surface position and orientation, humans apparently have no difficulty in accomplishing the task. In this case, one can hypothesize that visual cues from a table border propagate across a table surface to help in the placing task. Note that situations when the visual system has to rely on sparse visual cues to build a dense surface representation are not rare. Apart from the lack of texture, a visual scene may have very few highly reliable and informative cues (e.g., originating from strong edges of regular shapes) that can be effectively used in the locations where cues are less informative. Such cue propagation should rely on prior knowledge about surface shape in order to interpolate information properly.

The most realistic scenario for cue propagation is when a visual surface is flat or slowly curving and has a known orientation. For nonflat surfaces, cue interpretation depends on surface shape. Although most of the previous research focused either on flat surfaces or surfaces with only one principle direction of curvature (e.g., cylinders), this topic considers a more general case of nonflat surfaces. Specifically, it explores a case of cue integration when visual cues depend on both slant and shape of visual surfaces.

In the first section, I briefly review a cue integration framework that is used to estimate properties of visual scenes and visual surfaces in particular. Cue integration considers evidence from a single location, but cue propagation uses (interpolates) evidence from multiple locations to compensate for the absence or weakness of visual cues. A method for detection of weak cues and facilitation of information propagation to corresponding locations is discussed in the second section. The third section focuses on nonflat surfaces and considers a hypothesis that shape estimation precedes cue interpretation. We report the findings of two psychophysical experiments that suggest that shape cues participate in the interpretation of slant cues.

Cue Integration

A topic of human perception and perception of visual surfaces in particular was a subject of extensive research over a period of decades. Numerous visual and motor tasks were analyzed in order to determine how visual information about surface shape, orientation, position, motion, and so forth, is processed by a human brain (Cumming, Johnston, and Parker, 1993; Landy et al., 1995; Jacobs, 2002; Knill and Saunders, 2003). Researchers widely used a concept of an ideal observer to define the information content used to achieve a certain level of performance (Blake, Bülthoff, and Sheinberg, 1993; Buckley, Frisby, and Blake, 1996; Knill, 1998; Geisler, 2003). It was hypothesized that visual information is extracted and initially processed in little quasi-independent pieces called cues. The alternative to this modular approach is to analyze a visual scene in all its complexity and interactions (i.e., strong fusion). This does not seem plausible due the harsh requirements on processing resources and learning time (Clark and Yuille, 1990; Johnston, Cumming, and Parker, 1993; Landy et al., 1995; Nakayama and Shimojo, 1996).

In short, visual cues are simple features that are relatively easy to extract and that correlate with important characteristics of visual scenes. For example, shape and size gradients of texture elements can be considered as cues for surface orientation. The assumption that cues indicate a certain scene property independently significantly simplifies cue learning and processing. Consequently, cue learning can be accomplished in isolation based, for example, on the correlation of a given cue with other cues from the same or different modality (Ernst and Banks, 2002; Ivanchenko and Jacobs, 2004). Cue processing includes a simple mechanism for combining the evidence from independent cues to improve overall accuracy of estimation.

Classical Cue Integration Framework

A classical cue integration framework describes how several cues are combined at a single location. Because cues are assumed to be independently generated by a visual scene (or, in other words, the noise present in those cues is independent), the optimal rule for cue integration is essentially linear. Numerous experiments demonstrated that human performance corresponds well to the results of optimal cue integration (Young, Landy, and Maloney, 1993; Knill and Saunders, 2003; Hillis et al., 2004). Similar results were found when cues were integrated across different modalities such as visual, audio, and tactile (Ernst, Banks, and Bülthoff, 2000; Ernst and Banks, 2002; Battaglia, Jacobs, and Aslin, 2003).

Mathematically, one can express cue independence as the product of conditional probabilities. In addition, one can rewrite a probability of a scene property S conditioned on cue I1 and 12 according to a Bayesian rule. This allows inferences to be made about a scene property (a cause) from observed cues (the effect):

Here conditional probabilities on the right-hand side describe how cues I1, I2 are generated by a scene property S, and P(S) specifies prior information about S. The denominator can be ignored because it is constant for a given scene. In the case of independent Gaussian noise, Equation (5.1) leads to a simple linear rule for cue integration (Cochran, 1937):

Here, S1, S2 are the estimates of a scene property from individual cues; S is a combined cue estimate; and G)1 ,m2 are cue weights. It can be shown that weights are inversely proportional to cue uncertainty (i.e., variance of the noise) and that the uncertainty of a combined estimate is smaller than the uncertainty of any of its constituents. Cue uncertainty and, consequently, a cue weight change across visual conditions, and one of the important functions of the visual system is to adjust cue weights accordingly. When weights are known, Equation (5.2) provides a simple way to integrate estimates from individual cues to improve the overall accuracy of estimation.

Cue Conflict

Because independence of visual cues is an approximation, the performance can be further improved if the perceptual system compensates for deviations from a linear rule. One of such nonlinearities, called a cue conflict, arises when cue estimates are too discrepant to be meaningfully combined. This discrepancy was initially explained by the presence of processing errors and outliers (Landy et al., 1995). To avoid the erroneous estimates, a weaker cue is down weighted or even dropped from cue combination. This makes cue integration robust to the presence of outliers.

Such heuristic treatment of a cue conflict was recently rationalized in a probabilistic Bayesian framework by associating a large cue conflict with incorrect prior assumptions (Knill, 2003, 2007). Prior assumptions (not to be mistaken with a prior probability of a scene property) are what make cues informative about a visual scene. For example, if one assumes that the world consists of circular objects, the aspect ratio of a projected circular outline would be a good cue for object orientation. However, if objects in the world have ellipsoidal outlines (less constrained prior), the strength of the aspect ratio cue would be greatly reduced. A key idea in this example is the assumption that the world consists of a mixture of circular and ellipsoidal objects.

Under different viewing conditions, each class of objects may have different prior probabilities of occurrence. This can be modeled with a mixture of priors that specifies probabilities for each object class. Then, preferential attenuation of a cue during a conflict would be based on the evidence for each prior model. Mathematically, the mixture of priors for a cue likelihood describes how the cue is generated by different classes of objects:

Here, the probabilities associated with each prior assumption M1,M2 are weighted by n1,n2 to form a full likelihood model for a cue I. The weights reflect the degree to which each prior assumption applies to a current environment. A large conflict provides evidence that an object was drawn from an ensemble with a less-constrained prior. According to Equation (5.3), a cue attenuation effect happens when a less-constrained prior dominates a mixture. Because a less-constrained prior entails a less informative cue, a cue weight goes down during a cue conflict.

Example of Robust Cue Integration

In order to demonstrate some basic concepts of cue integration, I designed a simple visual example, shown in Figure 5.1. Here the shape of visual surfaces is represented by four cues: texture, shading, occlusion, and contour. Note that not all the cues are available at the same time. Particularly, at a small slant (top row) only texture and shading cues are present, while at a larger slant (bottom row) all four cues are present. The cues in the left column are consistent with the surface shape, and in order to demonstrate a cue conflict, the shading cue in the right column was made inconsistent with surface shape.

FIGURE 5.1 Robust cue integration for surface shape. In the left column, cues are consistent. In the right column, a shading cue is inconsistent with surface shape. In the top row, the surfaces are slanted at 30° from vertical, and in the bottom row the surfaces are slanted at 70°. A cue conflict is evident only at a large slant.

At a small slant, a shading cue for surface shape is the strongest one. According to a cue integration framework, texture and shading cues are combined, and the shading cue dominates the combination. When cues are inconsistent, a texture cue that has a smaller weight is dropped out of cue combination (upper right image). The perception of shape in this case is based solely on shading and is completely realistic in spite of the conflict. This demonstrates the robustness of cue integration in the presence of outliers (even though the outlier is a correct texture cue).

To demonstrate how cue strength varies across visual conditions, consider what happens when the surface slant is large (bottom row). At a large slant, two additional cues for shape became available—namely, a contour cue and an occlusion cue. Now the contour cue is the strongest one, and the remaining cues contribute their estimates to a lesser degree. In the case when cues are consistent (bottom left), the perception of shape remains intact. In the case when cues are inconsistent (bottom right), a relatively weaker shading cue is dropped out of the cue combination. This happens because a shading cue becomes an outlier due to its inconsistency with a stronger contour cue. Consequently, at a large slant, shading is no longer associated with surface shape and is perceived as a color of the surface (bottom right). Thus, the strength of visual cues changes with visual conditions and defines how visual cues are combined or ignored.

Cue Promotion

Another potential nonlinearity of cue integration happens when a cue probability becomes multimodal or several cues depend on a common parameter (Landy et al., 1995; van Ee, Adams, and Mamassian, 2003; Adams and Mamassian, 2004). In the case when cue likelihood is multimodal, cues can interact with each other for the purpose of disambiguation: one cue can help another to select a particular peak in its likelihood function. When cues depend on a common parameter, interaction happens for the purpose of cue promotion. One goal of cue promotion is to eliminate mutual dependencies of cues on a common parameter; another goal is to expresses cues on a common scale so that they can be meaningfully combined. For example, both binocular disparity and kinetic depth effect cues scale with a visual distance (a common parameter) but in essentially different ways. Only at one viewing distance, they produce a consistent depth image, which can be used to solve for a viewing distance and to restore cue independence.

Although a particular mechanism of cue promotion remains largely unknown, it does not seem to represent a significant challenge on the algorithmic level. In fact, Landy et al. (1995) suggested several methods for how to solve for a common parameter of depth cues. However, if a cue for a certain scene property also depends on surface shape that changes as a function of spatial location, solving jointly for the cue and shape may be nontrivial (Ivanchenko, 2006). A more computationally plausible solution is to estimate surface shape independently from other scene properties. In a sense, this is similar to independent estimations of cues. A further discussion of this issue is given in the next section.

To summarize, the cue integration framework makes several major assumptions about information extraction and processing, such as the use of simple visual features (cues) and cue independence. On the one hand, these assumptions are quite general and simplify computation and learning. On the other hand, the assumptions are only approximately true. The visual system seems to compensate for simple dependencies between cues using cue promotion and disambiguation. To further improve performance, the visual system adjusts the weights on cues and priors depending on viewing conditions. Overall, a cue integration framework makes a basis for simple, robust, and statistically optimal estimation of scene properties when cues are considered at a single spatial location.

Cue propagation in the Case of Noisy Cues

Cue propagation extends a cue integration framework by allowing the combining of visual cues at different spatial locations. The benefits of such integration are straightforward only if a visual surface has similar properties in some neighborhood; hence, we consider only planar or slowly curving surfaces. In those cases, highly informative but sparse cues can substitute for less informative neighbors due to the redundancy of visual information.

This mechanism is not only tractable in terms of computation and learning times but also seems to be physiologically plausible. It also allows information to be propagated along a surface without knowing surface position and orientation in advance. Merely specifying a prior constraint such as smoothness of some surface parameter is enough to produce a consistent and dense surface representation. Such constraints specify how information is interpolated during propagation.

Here I look at this mechanism from a viewpoint of cue integration theory and make some parallels between computational formulation and cue probabilities. I focus on the case when visual information is propagated into a surface area with noisy and unreliable cues and show that this situation is analogous to a cue conflict. The main goals are to analyze the conditions for cue propagation and compare the role of different prior constraints in surface reconstruction. As an example, I use three-dimensional (3D).Finding binocular correspondences remains a hot topic in computer vision (Sun, Shum, and Zheng, 2002; Zhang and Seitz, 2005; Seitz et al., 2006). Some of the challenges are scarcity of strong visual cues and presence of multiple matching hypotheses. Though the latter characteristic is different from the assumption of the classical cue integration framework (where a unimodal Gaussian distribution represents a single hypothesis), the same principles apply. For example, as the strength of a cue declines due to the changes in viewing conditions, the role of prior information correspondingly increases.

MRFs are widely used in computer vision to express the joint probability of a grid of random variables with local interaction (Weiss, 1997; Weiss and Freeman, 2001). Similar to cue integration, MRFs combine evidence from image cues and prior cue probabilities. Unlike cue integration, MRFs specify a constraint on neighboring variables to represent local interactions.

There are several methods for performing inference with MRFs that maximize either marginal or maximum a posteriori (MAP) probabilities. These methods were proven to perform optimally in the absence of loops in MRF but were also shown to perform well in loopy cases (Cochran, 1937; Felzenszwalb and Huttenlocher, 2006).This is because BP explicitly describes how neighboring variables exchange messages related to hypotheses about local surface properties.

For the purposes of the current discussion, it is sufficient to say that during BP, each variable in MRF sends a message to its neighbors, and this process iterates throughout the grid until messages converge. The content of these messages reflects a best sender’s guess about the receiver’s likelihood based on all locally available evidence including a prior. If we associate a discrete likelihood with a set of hypotheses about a visual property (e.g., disparity in the image), then we can say that through message updates, BP dynamically reevaluates all possible hypotheses based on the compromise between local evidence and a prior constraint.

Note that BP has no explicit mechanism for propagation of strong cues (i.e., those whose likelihoods have strong peaks) into the locations that contain weak cues (i.e., those whose likelihoods have no strong peaks). Nevertheless, such propagation does happen in the MRF framework. Moreover, the propagation can be facilitated if the regions where cues propagate contain variables with flat likelihood.Note that the textureless region has no cues for disparity; thus, corresponding likelihoods are flat. The propagation happens because MAP to which an algorithm converges is a product of likelihoods and smoothness priors, and the latter are higher for variables with similar likelihoods. Thus, initially flat likelihoods inside of the textureless region acquire values similar to the likelihoods at the region periphery. Importantly, in this case, the region with flat likelihoods has no evidence that is contrary to the one being propagated into the region.

Here we consider a case when cues are present throughout the image but weak or noisy at some of its regions (as happens in the areas with low contrast texture). Consequently, cue likelihoods are not flat but rather have a few weakly pronounced peaks that correspond to multiple matching hypotheses. Some peaks happen as a result of random noise in one or both images of a stereo pair. Though these noisy likelihoods look similar to flat ones, propagation is limited. As computer simulations show, variables at the regions with noisy likelihoods converge to MAP probabilities that show little or no influence from neighboring regions with strong likelihoods. Thus, propagation is limited in the case when likelihoods are not completely flat.

Algorithmically, this can be explained by the fact that when two variable likelihoods inside of a noisy region coincidentally express similar hypotheses, these hypotheses are reinforced by a smoothness prior. Variables with such reinforced likelihoods become a major obstruction to propagation. This is because the hypotheses expressed in their likelihood often do not support the ones propagated into the region (and vice versa). More research is required in order to better understand factors for information propagation in MRFs. Here I suggest some solutions for improving propagation in the areas with noisy cues.

A straightforward solution is to discard the regions with noisy or weak cues from consideration. This is a common practice for a disparity validation procedure in correlation-based stereo. The obvious disadvantage of this procedure is holes in a disparity map. Another solution is to use multiscale MRFs where short-distance propagation on a rough scale corresponds to long-distance propagation on a fine scale. However, a more principled approach is to detect weak likelihoods and flatten them out. A tentative definition of a discrete weak likelihood can be based on a comparison of the magnitude of its peak with an average value expected for a single level. Likelihood is weak if its peaks are not much larger than the expected mean value. Note that detecting weak likelihoods is similar to finding weak cues during a cue conflict. Moreover, a flattening process is similar to ignoring a discrepant cue during robust cue integration. This is because a flat likelihood carries no hypothesis and thus cannot be considered as a cue.

It is interesting to note that although a likelihood flattening method was derived solely from practical consideration, there is theoretical justification for it. It is based on the probabilistic theory of a cue conflict discussed in the first section. Mathematically, this theory describes a mixture of prior assumptions that make cues informative. For the problem of finding binocular correspondence, one can model images as a mixture of at least two types of objects. One contains strong edges (e.g., outlines of objects and high contrast texture), and the other includes image regions with low-contrast intensities (e.g., uniform color of object surfaces and low contrast texture). For the first class of objects, we can use very informative edge cues, and for the second class of objects, the cues are less informative. According to the mixture of priors approach, we can express a full cue likelihood as

Here the first component of the likelihood is due to informative edge cues, and the second component comes from less informative intensity cues. The strength of the edge can be used to indicate the weight to which each prior model applies. Then a formula for likelihood (Coughlan, 2011, this volume ) that expresses a matching error, m(D) at a certain image location and disparity D,

can be rewritten as

Where ß2 < to reflect the fact that an intensity cue is less informative than the edge cue. In the marginal case of ß2 = 0, n intensity = 1, and Kddge = 0, we obtain the above-described flattening method for likelihoods. Note that in order to detect cues with weak likelihoods, one can directly analyze likelihood peaks, which is more beneficial than measuring edge strength. The former has an advantage of finding cases when the reason for weak likelihood is not only a low contrast texture but also a highly regular texture (i.e., the one that produces multiple matching hypotheses).

As computer simulations show, flattening weak likelihoods greatly improves propagation of disparity information into regions with weak cues. Such a method is especially applicable in stereo algorithms, because they often rely on some smoothness constraint that justifies cue propagation and provides a reasonable way to interpolate information along the surface. The only image regions where smoothness constraint is usually not enforced correspond to borders of objects; but those areas typically have high-contrast pixels with strong cues and thus are not affected by propagation. Thus, a cue conflict theory seems to be applicable to at least one computer vision problem where it facilitates information propagation and the creation of dense stereo maps from sparse cues.

FIGURE 5.2 Surface reconstruction from binocular disparity. Left (a): region of one of the images from a stereo pair; area for reconstruction is depicted with a white rectangle. Top right (b): surface reconstruction based on disparity smoothness prior. Bottom right(c): surface reconstruction based on elevation smoothness prior.

Figure 5.2 shows 3D reconstruction of a surface region based on the above described flattening method. Note that reconstructed surfaces are dense and have no holes that are typical for a correlation-based stereo. It was possible to run a BP algorithm in near real time (less than 200 ms on GPU) with VGA image resolution and 32 disparity levels (Ivanchenko, Shen, and Coughlan, 2009).

We reconstructed a distant image region on purpose because the strength of a disparity cue decreases with the viewing distance. This makes distant regions good candidates for observing weak cues and for analyzing the role of a smoothness constraint that increases when a cue strength decreases. Note that this constraint describes properties of visual surfaces in the world as opposed to a mixture of priors that describes properties of the images.

In 3D reconstruction, we used two different smoothness constraints: one enforcing smooth disparity and one enforcing smooth elevation of visual surfaces. These constraints provide a way for automatic interpolation of information along the surface and also bias surface reconstruction to produce either fronto-parallel surface patches (disparity prior) or patches that are parallel to the ground plane (elevation prior). As can be seen in Figure 5.2, the reconstructed surfaces of a staircase have different shapes depending on a corresponding smoothness assumption. Two reconstructions look similar only at the locations where cues are strong (i.e., have strong edges).

Note that while a flattening method helps to obtain a dense surface representation, the form of representation in the areas with weak cues is affected by a smoothness constraint. Which reconstruction and corresponding smoothness constraint is better? There is no single answer to this question. The factors that influence a choice of a constraint include the structure of the environment, task requirements, and the compactness of a resulting representation (e.g., a number of disparity levels).

Specifically, we considered the case when strong cues are propagated across flat or slowly curving surfaces into the areas with weak or noisy cues. To guarantee that propagation is efficient, it was suggested that cue likelihoods that have only weak peaks should be flattened completely. This was justified under a probabilistic approach that considers a mixture of prior assumptions for each cue. For the purpose of finding a dense disparity map, stereo images were modeled as a mixture of two classes of objects: one that has strong edges and one with low-contrast intensities. Note that the first class of object is typically sparse in natural images and has strong cues; the second class of object is less sparse but has weaker cues. To define a prior probability of each class in a certain image location, one can use edge strength or directly analyze cue likelihood. While mixture of priors describes image properties, there is another class of priors that describe surfaces. This section analyzed the effect of two such priors (smoothness of elevation and disparity) on 3D reconstruction.