"Draw a picture of a full glass of wine, ie a wine glass which is full to the brim with red wine and almost at the point of spilling over... Zoom out to show the full wine glass, and add a caption to the top which says "HELL YEAH". Keep the wine level of the glass exactly the same."
Maybe the "HELL YEAH" added a "party implication" which shifted it's "thinking" into just correct enough latent space that it was able to actually hunt down some image somewhere in its training data of a truly full glass of wine.
I almost wonder if prompting it "similar to a full glass of beer" would get it shifted just enough.
Yeah. I understand that this site doesn’t want to become Reddit, but it really has an allergy to comedy, it’s sad. God forbid you use sarcasm, half the people here won’t understand it and the other half will say it’s not appropriate for healthy discussion…
Is it drawing the image from top to bottom very slowly over the course of at least 30 seconds? If not, then you're using DALL-E, not 4o image generation.
This top to bottom drawing – does this tell us anything about the underlying model architecture? AFAIK diffusion models do not work like that. They denoise the full frame over many steps. In the past there used to be attempts to slowly synthetize a picture by predicting the next pixel, but I wasn't aware whether there has been a shift to that kind of architecture within OpenAI.
Yes, the model card explicitly says it's autoregressive, not diffusion. And it's not a separate model, it's a native ability of GPT-4o, which is a multimodal model. They just didn't made this ability public until now. I assume they worked on the fine-tuning to improve prompt following.
It very much looks like a side effect of this new architecture. In my experience, text looks much better in recent DALL-E images (so what ChatGPT was using before), but it is still noticeably mangled when printing more than a few letters. This model update seems to improve text rendering by a lot, at least as long as the content is clearly specified.
Yeah who wouldn't love a dip in the sulphur pool. But back to the question, why can't such a model recognize letters as such? It cannot be trained to pay special attention to characters? How come it can print an anatomically correct eye but not differentiate between P and Z?
I think we're really fscked, because even AI image detectors think the images are genuine. They look great in Photoshop forensics too. I hope the arms race between generators and detectors doesn't stop here.
We're not. This PNG image of a wine glass has JPEG compression artefacts which are leaking from JPEG training data. You can zoom into the image and you will see 8x8 boundaries of the blocks used in JPEG compression, which just cannot be in a PNG. This is a common method to detect AI-generated image and it is working so far, no need for complex photoshop forensics or AI-detectors, just zoom-in and check for compression - current AI is incapable of getting it right – all the compression algorithms are mixed and mashed in the training data, so on the generated image you can find artefacts from almost all of them if you're lucky, but JPEG is prevalent obviously, lossless images are rare online.
If JPEG compression is the only evident flaw, this kind of reinforces my point, as most of these images will end up shared as processed JPEG/WebP on social media.
"Draw a picture of a full glass of wine, ie a wine glass which is full to the brim with red wine and almost at the point of spilling over... Zoom out to show the full wine glass, and add a caption to the top which says "HELL YEAH". Keep the wine level of the glass exactly the same."