Learning MRF Parameters
The MRF model in the “Markov Random Fields for Stereo” section above has a number of free parameters (such as ß, t, and ß) that must be set correctly for the model to be realistic and accurate enough to make good inferences. There are well-established procedures (Scharstein and Pal, 2005) for learning MRF parameters from “labeled” data samples, in this case left/right image pairs and true (“ground truth”) disparities. However, until recently, few datasets included ground truth disparity fields, which made it difficult to learn the MRF parameters. Fortunately, this obstacle is being removed now that there are an increasing number of datasets that include ground truth, which is determined using tools such as laser range finders (used to measure the precise depth and, hence, disparity, of nearly every pixel in a scene).
Learning MRF parameters from data not only provides a principled way of choosing the model parameters (which reflects the statistics of depth and intensity in natural scenes) but has also led to improved model performance, measured by comparing the disparity field estimated by the model with the ground truth disparities (Zhang and Seitz, 2005).
More Realistic Priors
An important limitation of the MRF prior in Equation (3.1) is that it penalizes disparity differences in neighboring pixels, which implies a bias in favor of fronto-parallel surfaces. Such a bias is inappropriate for many real-world scenes with slanted surfaces. Even the toy example considered in the section on how MRFs propagate information is likely to fail if the surface is slanted: the prior may have trouble propagating the linearly changing disparity beyond the textured region of the image. In such cases, although the first x and y derivatives of disparity may be nonzero, the second derivatives are zero. (Any planar surface has an associated disparity field Dr = ax + by + c, where r = (x, y ), that is, the disparity is linear in the x and y image coordinates.)
Ongoing research in my laboratory seeks to overcome this fronto-parallel bias in the context of a specific application: terrain analysis for visually impaired wheelchair users. In this application (Ivanchenko et al., 2008), a stereo camera is pointed at the ground, such that the optical axis makes an angle of approximately 45° with the ground surface. The goal is to detect terrain irregularities such as obstacles, holes in the ground, and curbs, and to convey this information to the wheelchair user.
We designed a real-time algorithm for detecting and reporting terrain irregularities using a fast, commercially available stereo algorithm that is integrated with the stereo camera hardware. The stereo algorithm is based on simple window correlation rather than an MRF model and is therefore very fast, processing many frames per second. The disadvantage of using such a fast algorithm is that it produces sparse, noisy disparity estimates, and smooths over depth discontinuities. However, the quality of the disparity estimates suffices for detecting large terrain irregularities such as trees and other obstacles. When the algorithm fails to detect any significant deviations from the dominant ground plane (e.g., sidewalk surface) in the scene, it seems sensible to apply a more sophisticated stereo algorithm such as a MRF model to examine the scene in more detail. A second algorithm such as this may reveal the presence of a curb or other subtle depth discontinuity that was missed by the first algorithm.
The slant of the ground plane means that the disparity of the ground changes appreciably from one image row to the next, violating the fronto-parallel assumption. To rectify this problem, we are experimenting with warping one of the images so as to remove the disparity corresponding to the ground plane. (This idea was originally proposed in Burt, Wixson, and Salgian ). Thus, only scene points that lie off the ground plane will have nonzero disparity, and planes parallel to the ground plane (e.g., the road bordering the sidewalk) will have uniform disparities. In this way, the image data are transformed so that the fronto-parallel bias is appropriate.
Such a transformation may prove valuable for our application, but a more general solution is to impose a prior that assumes that locally planar surfaces with arbitrary slant and tilt are common. One way to enforce such a prior is to penalize deviations of the second derivatives of the disparity field from zero. At a minimum, such a prior must evaluate the relationship among three consecutive pixel disparities, because a second derivative requires three consecutive samples to be estimated. (A second derivative of zero implies that the three points are collinear in 3D.) This measure is beyond the capability of the pairwise MRF presented in this topic, and a straightforward implementation using a more powerful MRF with ternary (triplet) interactions would be extremely computationally demanding. Recent work (Woodford et al., 2008) replaces BP for such an implementation with another energy minimization algorithm that is much more efficient for this problem. The result is a tractable stereo algorithm with superior performance, particularly in its ability to propagate surface information on non-fronto-parallel surfaces.
From time to time, I am asked by neuroscientists and psychophysicists if MRF models such as the ones described in this topic have anything to do with biological vision systems. Although I confess to knowing little about biological vision, I would like to point to work by others arguing that the MRF-BP framework (perhaps extended to incorporate multiple depth cues) may be biologically plausible.
From a biological perspective, perhaps the most important property of models cast in this framework is that they are fully parallelizable: one can implement BP in a parallel hardware system with one computing node for each variable in the MRF, with directed connections between neighboring variables to represent BP messages. In each iteration of BP, messages flow along these connections from each variable node to neighboring nodes. Lee and Mumford (2003) have argued that BP may be a model for how information is passed top-down and bottom-up in the brain. Recent research (Ott and Stoop, 2006) has established that BP for MRFs with binary-valued variables (i.e., each variable can assume only two possible states) can be formulated with continuous time updates (rather than discrete time updates), resulting in behavior that closely matches the dynamics of a Hopfield network. Other work (Doya et al., 2006) relaxes the assumption of binaryvalued variables and relates BP to a spiking network model.
In this topic, I presented an MRF framework for propagating surface information in 3D reconstruction in the presence of noisy and sparse depth cues. In addition to automatically weighing prior and likelihood information according to their reliability, the framework is the basis for many of the top-performing stereo algorithms in computer vision (see Scharstein and Szeliski, 2002), and the Web site associated with it, vision.middlebury. edu/stereo, which maintains up-to-date performance rankings of state-of-the-art stereo algorithms. While the standard prior used in MRF stereo algorithms imposes an unnatural fronto-parallel bias, promising recent work demonstrates the value of using a more realistic prior that accommodates the frequent occurrence of locally planar surfaces with arbitrary slant and tilt.
Although 3D reconstruction algorithms have improved a lot in recent years, much work remains. Despite the recent emphasis on learning model parameters from training data, the images used for training and testing often contain more highly textured, colorful objects than commonly occur in real-world scenes, which casts doubt on the ability of even the top-performing algorithms to generalize to the real-world domain. Additional performance measures may need to be developed to reward algorithms that minimize the kinds of catastrophic inference errors that are all too common at present, in which the disparities of some points are estimated incorrectly by tens of pixels.
More realistic priors will also be needed for algorithms to improve further. The price of increased realism may be the use of higher-level priors formulated to represent coherent surfaces, such as planar and cylindrical patches with explicit boundaries, rather than pixel-based depth or disparity fields.
Another avenue for improvement will be to integrate multiple depth cues, including monocular cues such as shading and texture, in addition to standard disparity cues. (Indeed, impressive work by Saxena, Sun, and Ng, 2007, estimates a depth field from a single color image using such cues.) It will also be important to integrate information over time (i.e., multiple video frames).
Finally, it is worth pointing out that improvements in optimization techniques such as BP will be required to realize many of the proposed extensions above, and may well influence the direction of future research.