The output is not automatically metrically scaled (though you can use postprocessing to fix this, it's not part of this model). And you can't really move around much without getting glitches, because it only inferences in one axis. It's also hard capped at 768 pixels + 2 layers.
Besides depth/splatting models have been around for quite a while before this. The main thing this model innovates on is inference speed, but VR porn isn't a use case that really benefits from faster image/video processing, especially since it's still not realtime.
This year has seen a lot of innovation in this space, but it's coming from other image editing and video models.
It's not for moving around, but for turning some image into a stereoscopic one (or 2 side-by-side images if you will). Lots of techniques for this exist, which usually turn an image into depth information using AI and then use any number of approaches to generate/warp 2 offset images from it.
So far the best looking results are still achieved with good old mesh warping and no inpainting at all. This may change that.
Ah, but if we're not talking 6DOF what's new with ml-sharp? We've had good autostereoscopy for a couple of years at least.
> So far the best looking results are still achieved with good old mesh warping and no inpainting at all.
I agree
> This may change that.
Seems not to be the case in my testing. The splats are too fine and sparse to yield an improvement. There are actually better (slower) image -> splat models than ml-sharp (with much higher dynamic range for the covariance) but I still don't use them over meshes for this.
The only improvements ml-sharp seems to add to the SOTA is 1) speed and 2) an interesting 2-focal layer architecture, but these are somewhat tangential steps.
I'm not kidding. That's going to be >80% of the images/videos synthesized with this.