Flux 1.1 Pro has good prompt adherence, but some of these (admittingly cherry-picked) GPT-4o generated image demos are beyond what you would get with Flux without a lot of iteration, particularly the large paragraphs of text.
I'm excited to see what a Flux 2 can do if it can actually use a modern text encoder.
Structural editing and control nets are much more powerful than text prompting alone.
The image generators used by creatives will not be text-first.
"Dragon with brown leathery scales with an elephant texture and 10% reflectivity positioned three degrees under the mountain, which is approximately 250 meters taller than the next peak, ..." is not how you design.
Creative work is not 100% dice rolling in a crude and inadequate language. Encoding spatial and qualitative details is impossible. "A picture is worth a thousand words" is an understatement.
It can do in-context learning from images you upload. So you can just upload a depth map or mark up an image with the locations of edits you want and it should be able to handle that. I guess my point is that since its the same model that understands how to see images and how to generate them you aren't restricted from interacting with it via text only.
Prompt adherence and additional tricks such as ControlNet/ComfyUI pipelines are not mutually exclusive. Both are very important to get good image generation results.
It is when it's kept behind an API. You cannot use Controlnet/ComfyUI and especially not the best stuff like regional prompting with this model. You can't do it with Gemini, and that's by design because otherwise coomers are going to generate 999999 anime waifus like they do on Civit.ai.
That's a fun idea—but generating an image with 999,999 anime waifus in it isn't technically possible due to visual and processing limits. But we can get creative.
Want me to generate:
1. A massive crowd of anime waifus (like a big collage or crowd scene)?
2. A stylized representation of “999999 anime waifus” (maybe with a few in focus and the rest as silhouettes or a sea of colors)?
3. A single waifu with a visual reference to the number 999999 (like a title, emblem, or digital counter in the background)?
Let me know your vibe—epic, funny, serious, chaotic?
> Yeah, but then it no longer replaces human artists.
Automation tools are always more powerful as a force multiplier for skilled users than a complete replacement. (Which is still a replacement on any given task scope, since it reduces the number of human labor hours — and, given any elapsed time constraints, human laborers — needed.)
We're not trying to replace human artists. We're trying to make them more efficient.
We might find that the entire "studio system" is a gross inefficiency and that individual artists and directors can self-publish like on Steam or YouTube.
Exactly. OpenAI isn't going to win image and video.
Sora is one of the worst video generators. The Chinese have really taken the lead in video with Kling, Hailuo, and the open source Wan and Hunyuan.
Wan with LoRAs will enable real creative work. Motion control, character consistency. There's no place for an OpenAI Sora type product other than as a cheap LLM add-in.