Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

https://i.imgur.com/xsFKqsI.png

"Draw a picture of a full glass of wine, ie a wine glass which is full to the brim with red wine and almost at the point of spilling over... Zoom out to show the full wine glass, and add a caption to the top which says "HELL YEAH". Keep the wine level of the glass exactly the same."



Maybe the "HELL YEAH" added a "party implication" which shifted it's "thinking" into just correct enough latent space that it was able to actually hunt down some image somewhere in its training data of a truly full glass of wine.

I almost wonder if prompting it "similar to a full glass of beer" would get it shifted just enough.


Can't replicate. Maybe the rollout is staggered? Using Plus from Europe, it's consistently giving me a half full glass.


I am using Plus from Australia, and while I am not getting a full glass, nor am I getting a half full glass. The glass I'm getting is half empty.


Surprised it isn't fully empty for being upside down!


That's funny. HN hates funny. Enjoy your shadowban.


Yeah. I understand that this site doesn’t want to become Reddit, but it really has an allergy to comedy, it’s sad. God forbid you use sarcasm, half the people here won’t understand it and the other half will say it’s not appropriate for healthy discussion…


Good example in this very discussion: https://news.ycombinator.com/item?id=43477003


I like this site, but it can become inhuman sometimes.

People get upvoted for pedantry rather than furthering a conversation, e.g.


Is it drawing the image from top to bottom very slowly over the course of at least 30 seconds? If not, then you're using DALL-E, not 4o image generation.


This top to bottom drawing – does this tell us anything about the underlying model architecture? AFAIK diffusion models do not work like that. They denoise the full frame over many steps. In the past there used to be attempts to slowly synthetize a picture by predicting the next pixel, but I wasn't aware whether there has been a shift to that kind of architecture within OpenAI.


Yes, the model card explicitly says it's autoregressive, not diffusion. And it's not a separate model, it's a native ability of GPT-4o, which is a multimodal model. They just didn't made this ability public until now. I assume they worked on the fine-tuning to improve prompt following.


apparently it's not diffusion, but tokens


Works for me as well https://chatgpt.com/share/67e3f838-63fc-8000-ab94-5d10626397...

USA, but VPN set to exit in Canada at time of request (I think).


The EU got the drunken version. And a good drunk know not to top of a glass of wine ever. In that context the glass is already "full".

But aside from that it would only be comparable if would compare your prompts.


Maybe it's half empty.


ha


You might still be on DALL-E. My account is if you use ChatGPT.

I switched over to the sora.com domain and now I have access to it.


the free site even has it, just dont turn on image generation it works with it off, if you enable it it uses dall-e


Most interesting thing to me is the spelling is correct.

I'm not a heavy user of AI or image generation in general, so is this also part of the new release or has this been fixed silently since last I tried?


It very much looks like a side effect of this new architecture. In my experience, text looks much better in recent DALL-E images (so what ChatGPT was using before), but it is still noticeably mangled when printing more than a few letters. This model update seems to improve text rendering by a lot, at least as long as the content is clearly specified.

However, when giving a prompt that requires the model to come up with the text itself, it still seems to struggle a bit, as can be seen in this hilarious example from the post: https://images.ctfassets.net/kftzwdyauwt9/21nVyfD2KFeriJXUNL...


The periodic table is absolutely hilarious, I didn't know LLMs had finally mastered absurdist humor.


Yeah who wouldn't love a dip in the sulphur pool. But back to the question, why can't such a model recognize letters as such? It cannot be trained to pay special attention to characters? How come it can print an anatomically correct eye but not differentiate between P and Z?


I think the model has not decided if it should print a P or a Z, so you end up with something halfway between the two.

It's a side effect of the entire model being differentiable - there is always some halfway point.


The head of foam on that glass of wine is perfect!


I think we're really fscked, because even AI image detectors think the images are genuine. They look great in Photoshop forensics too. I hope the arms race between generators and detectors doesn't stop here.


We're not. This PNG image of a wine glass has JPEG compression artefacts which are leaking from JPEG training data. You can zoom into the image and you will see 8x8 boundaries of the blocks used in JPEG compression, which just cannot be in a PNG. This is a common method to detect AI-generated image and it is working so far, no need for complex photoshop forensics or AI-detectors, just zoom-in and check for compression - current AI is incapable of getting it right – all the compression algorithms are mixed and mashed in the training data, so on the generated image you can find artefacts from almost all of them if you're lucky, but JPEG is prevalent obviously, lossless images are rare online.


If JPEG compression is the only evident flaw, this kind of reinforces my point, as most of these images will end up shared as processed JPEG/WebP on social media.


You didn't get it. The image contains ALL compression artifacts from different algorithms mashed up in a single picture, the JPEG is just prevalent.


Oh, I see. There's still room for reliable detection then.


plenty of real PNG images have jpeg artifacts because they were once jpegs off someones phone...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: