I wonder if people who say LLMs are a smart junior programmer have ever used LLMs for coding or actually worked with a junior programmer before. Because for me the two are not even remotely comparable.
If I ask Claude to do a basic operation on all files in my codebase it won't do it. Half way through it will get distracted and do something else or simply change the operation. No junior programmer will ever do this. And similar for the other examples in the blog.
Right, that is their main limitation currently - unable to consider the full system context when operating on a specific feature. But you must work with excellent juniors (or I work with very poor ones) because getting them to think about changes in the context of the bigger picture is a challenge.
This is definitely a huge factor I see in the mistakes. If I hand an LLM some other parts of the codebase along with my request so that it has more context, it makes less mistakes.
These problems are getting solved as LLMs improve in terms of context length and having the tools send the LLM all the information it needs.
To be fair the guys I get are pretty good and actually learn. The model doesn't. I have to have the same arguments over and over again with the model. Then I have to retain what arguments I had last time. Then when they update the model it comes up with new stupid things I have to argue with it on.
Net loss for me. I have no idea how people are finding these things productive unless they really don't know or care what garbage comes out.
> the guys I get are pretty good and actually learn. The model doesn't.
Core issue. LLMs never ever leave their base level unless you actively modify the prompt. I suppose you _could_ use finetuning to whip it into a useful shape, but that's a lot of work. (https://arxiv.org/pdf/2308.09895 is a good read)
But the flip side of that core issue is that if the base level is high, they're good. Which means for Python & JS, they're pretty darn good. Making pandas garbage work? Just the task for an LLM.
But yeah, R & nginx is not a major part of their original training data, and so they're stuck at "no clue, whatever stackoverflow on similar keywords said".
Perhaps swearing at the LLM actually produces worse results?
Not sure if you’re being figurative, but if what you wrote in your first comment is indicative of the tone with which you prompt the LLM, then I’m not surprised you get terrible results. Swearing at the model doesn’t help it produce better code. The model isn’t going to be intimidated by you or worried about losing their job—which I bet your junior engineers are.
Ultimately, prompting LLMs is simply a matter of writing well. Some people seem to write prompts like flippant Slack messages, expecting the LLM to somehow have a dialogue with you to clarify your poorly-framed, half-assed requirement statements. That’s just not how they work. Specify what you actually want and they can execute on that. Why do you expect the LLM to read your mind and know the shape of nginx logs vs nginx-ingress logs? Why not provide an example in the prompt?
It’s odd—I go out of my way to “treat” the LLMs with respect, and find myself feeling an emotional reaction when others write to them with lots of negativity. Not sure what to make of that.
But at the same time it'll write me 2000 lines of really gnarly text parsing code in a very optimized fashion that would have taken a senior dev all day to crank out.
We have to stop trying to compare them to a human, because they are alien. They make mistakes humans wouldn't, and they complete very difficult tasks that would be tedious and difficult for humans. All in the same output.
I'm net-positive from using AI, though. It can definitely remove a lot of tedium.
If I ask Claude to do a basic operation on all files in my codebase it won't do it. Half way through it will get distracted and do something else or simply change the operation. No junior programmer will ever do this. And similar for the other examples in the blog.