2D is nice, but these days I’m getting interested in doing computer vision in 3D. One way to get 3D data is to use two cameras and determine distance by looking at the differences in the two pictures (just like eyes!). In this project I show some initial results and codes for computing disparity from stereo images.
UPDATE: Check this recent post for a newer, faster version of this code. The new version no longer relies on mean-shift.
People can see depth because they look at the same scene at two slightly different angles (one from each eye). Our brains then figure out how close things are by determining how far apart they are in the two images from our eyes. The idea here is to do the same thing with a computer. Check this for some information on the geometry and mathematics of stereo vision. First, here are the images I’ll use to show results.
These two images are slightly different. The top one is from the left and the bottom is from the right. It’s a bit hard to see the disparity like this, so here are the same two images placed “on top” of one another.
You can see that the close-up objects like the lamp are very misaligned in the two images, while the farther-away things like the poster and the camera are lined up better. The greater the misalignment, the closer the object.
This pair of images is one of many standard stereo pairs that can be found at the Middlebury stereo vision site. These guys keep a compendium of standard datasets as well as a scoreboard of who’s algorithms work the best. The algorithm I talk about here is a knock-off of the one that was on top in December 2007: “Segment-Based Stereo Matching Using Belief Propogation and a Self-Adapting Dissimilarity Measure[PDF]” by Klaus, Sormann, and Karner. (Mind that the algorithm here is *inspired* by the algorithm of Klaus et al. Theirs is much more complete)
Getting Pixel Disparity
The first step here is to get an estimate of the disparity at each pixel in the image. A reference image is chosen (in this case, the right image), and the other image slides across it. As the two images ‘slide’ over one another we subtract their intensity values. Additionally, we subtract gradient information from the images (spatial derivatives). Combining these gives better accuracy, especially on surfaces with texture. In the video below, we can see a visualization of this process. You’ll notice how far-away objects go dark (meaning they line up in the two images) at different times than close-up objects. We record the offset when the difference is the smallest as well as the value of the difference.
We perform this slide-and-subtract operation from right-to-left (R-L) and left-to-right (L-R). Then we try to eliminate bad pixels in two ways. First, we use the disparity from the R-L pass or the L-R pass depending on which has the lowest matching difference. Next, we mark as bad all points where the R-L disparity is significantly different from the L-R disparity. Finally, we are left with a pixel disparity map.
In this image, red-er colors indicate closer pixels, and blue-er colors represent pixels that are farther away.
Filtering the Pixel Disparity
In the next step, we combine image information with the pixel disparities to get a cleaner disparity map. First, we segment the reference image using a technique called “Mean Shift Segmentation.” This is a clustering algorithm that “over-segments” the image. The result is a very ‘blocky’ version of the original image.
Then, for each segment, we look at the associated pixel disparities. In my simple implementation, we assign each segment to have the median disparity of all the pixels within that segment. This gives the final result:
Here again, the red colors are close objects, and blue objects are far away.
I spent some time getting these simple ideas into working form. I’ve posted the codes and images I used as well as a demo script. To see how the code works, simply un-archive everything and run demo from a Matlab prompt. Enjoy, and let me know how these work for you.
Check out the newest results:
This stereo algorithm is just a tool to be used on other projects. For instance, by computing the stereo disparity of a stereoscopic video it is possible to improve tracking results by using the 3D information. Also, segmentations can be made more accurate if 3D information is known.
I wanted to put this up to introduce people to stereo vision (as this was my introductory project). Hopefully, the words and the codes above will save you some time getting up to speed. I would ask that if you find this helpful and make improvements that you let me know!