I feel like this misses the point. Also, vision and image generation are entirel...

I feel like this misses the point. Also, vision and image generation are entirely different things. Even for humans, with some people not being able to create images in their head despite having perfectly good vision.

Understanding optics instead of intelligence speaks to the traditional render workflow, a pure simulation of input data with no "creative processes". Either the massive hack that is traditional game render pipelines, or proper light simulation. We'll probably eventually get to the point where we can have full-scene, real-time ray-tracing.

The AI image generation approach is the "intelligence" approach where you throw all optics, physics and render knowledge up in the air and let the model "paint" according to how it imagines the scene, like handing a pencil to a cartoon/anime artist. Zero simulation, zero physics, zero roles - just the imagination of a black box.

No light, physics or existing render pipeline tricks are relevant. If that's what you want, you're looking for entirely new tricks: Tricks to ensure object permanence, attention to detail (no variable finger counts), and inference performance. Even if we have it running in real-time, giving up your control and definition of consistency is part of the deal when you hand off the role of artist to the box.

If you want AI in the simulation approach you'll be taking an entirely different path, skipping any involvement in rendering/image creation and instead just letting the model pupetteer the scene within some physics restraints. Makes for cool games, but completely unrelated to the technology being discussed.