It's hard to compare the two tools because they change so much and so fast.
Right now, as an example, claude code with opus 4.5 is a beast, but before that, with sonnet 4.0, codex was much better.
Gemini-cli, on the other hand, with gemini-flash-3.0 (which is strangely good for the "small and fast" model), it's very good (but the cli and the user experience are not on par with codex or claude yet).
So we need to be in constant observations of those tools. Currently (after gemini-flash-3.0 came out), I tend to submit the same task to claude (with opus) and gemini to understand the behaviour. gemini is surprising me.
Heya, author here! I completely agree with you — and why the post is titled Codex vs. Claude Code (Today). I also have this very specific disclaimer in the second paragraph to note that this post is a reflection of a moment in time. :D
> Before we continue, I need to make a disclaimer: This post is about the Claude Code and Codex, on December 22, 2025. Everything in AI changes so fast that I have almost no expectations about the validity of these statements in a year, or probably even 3-6 months from now.
That said I do what you do and try different models when I want to see if things have changed. I run my own private little benchmarks with a few complex real world tasks, and I really love seeing how things are progressing — both in terms of quality but also the novel quirks that are introduced, changed, or removed. :)
Right now, as an example, claude code with opus 4.5 is a beast, but before that, with sonnet 4.0, codex was much better.
Gemini-cli, on the other hand, with gemini-flash-3.0 (which is strangely good for the "small and fast" model), it's very good (but the cli and the user experience are not on par with codex or claude yet).
So we need to be in constant observations of those tools. Currently (after gemini-flash-3.0 came out), I tend to submit the same task to claude (with opus) and gemini to understand the behaviour. gemini is surprising me.