As mentioned, the problem is **very** hard and is often also referred to as multi-view object reconstruction. It is usually approached by solving the stereo-view reconstruction problem for each pair of consecutive images.

Performing stereo reconstruction requires that pairs of images are taken that have a good amount of visible overlap of physical points. You need to find corresponding points such that you can then use triangulation to find the 3D co-ordinates of the points.

## Epipolar geometry

Stereo reconstruction is usually done by first calibrating your camera setup so you can rectify your images using the theory of epipolar geometry. This simplifies finding corresponding points as well as the final triangulation calculations.

If you have:

you can calculate the fundamental and essential matrices using only matrix theory and use these to rectify your images. This requires some theory about co-ordinate projections with homogeneous co-ordinates and also knowledge of the pinhole camera model and camera matrix.

If you want a method that doesn't need the camera parameters and works for unknown camera set-ups you should probably look into methods for uncalibrated stereo reconstruction.

## Correspondence problem

Finding corresponding points is the tricky part that requires you to look for points of the same brightness or colour, or to use texture patterns or some other features to identify the same points in pairs of images. Techniques for this either work *locally* by looking for a best match in a small region around each point, or *globally* by considering the image as a whole.

If you already have the fundamental matrix, it will allow you to rectify the images such that corresponding points in two images will be constrained to a line (in theory). This helps you to use faster local techniques.

There is currently still no ideal technique to solve the correspondence problem, but possible approaches could fall in these categories:

**Manual selection**: have a person hand-select matching points.
**Custom markers**: place markers or use specific patterns/colours that you can easily identify.
**Sum of squared differences**: take a region around a point and find the closest whole matching region in the other image.
**Graph cuts**: a global optimisation technique based on optimisation using graph theory.

For specific implementations you can use Google Scholar to search through the current literature. Here is one highly cited paper comparing various techniques:
A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms.

## Multi-view reconstruction

Once you have the corresponding points, you can then use epipolar geometry theory for the triangulation calculations to find the 3D co-ordinates of the points.

This whole stereo reconstruction would then be repeated for each pair of consecutive images (implying that you need an order to the images or at least knowledge of which images have many overlapping points). For each pair you would calculate a different fundamental matrix.

Of course, due to noise or inaccuracies at each of these steps you might want to consider how to solve the problem in a more global manner. For instance, if you have a series of images that are taken around an object and form a loop, this provides extra constraints that can be used to improve the accuracy of earlier steps using something like bundle adjustment.

As you can see, both stereo and multi-view reconstruction are far from solved problems and are still actively researched. The less you want to do in an automated manner the more well-defined the problem becomes, but even in these cases quite a bit of theory is required to get started.

## Alternatives

If it's within the constraints of what you want to do, I would recommend considering dedicated hardware sensors (such as the XBox's Kinect) instead of only using normal cameras. These sensors use structured light, time-of-flight or some other range imaging technique to generate a depth image which they can also combine with colour data from their own cameras. They practically solve the single-view reconstruction problem for you and often include libraries and tools for stitching/combining multiple views.

## Epipolar geometry references

My knowledge is actually quite thin on most of the theory, so the best I can do is to further provide you with some references that are hopefully useful (in order of relevance):

I'm not sure how helpful all of this is, but hopefully it includes enough useful terminology and references to find further resources.

## Best Solution

What you're talking about is depth mapping, or 'disparity mapping', which is the basis of stereoscopic computer vision. The OpenCV project has libraries which do this. I don't know if they directly convert into a rotatable 3D object, which may be what you are looking for, but they probably come close.

http://opencv.willowgarage.com/wiki/

http://en.wikipedia.org/wiki/OpenCV

The dimension part is a little harder. The libraries can identify objects, but that would just give you 'common' dimension. If you were using stereoscopic imaging with two cameras, you could determine real depth, and therefore dimensions, from multiple samples.

The problem is difficult, really difficult, otherwise (i.e., impossible.)