To be clear, I'm not saying that LLM's exclusively make non-human errors. I'm more saying that most errors are happening for different "reasons" than humans.
Think about the strawberry example. I've seen a lot of articles lately where not all misspellings of the word "strawberry" reliably give letter counting errors. The general sentiment there is human, but the specific pattern of misspelling is really more unique to LLM's (i.e. different spelling errors would impact humans versus LLM's).
The part that makes it challenging is that we don't know these "triggers." You could have a prompt that has 95% accuracy, but that inexplicably drops to 50% if the word "green" is in the question (or something like that).
Think about the strawberry example. I've seen a lot of articles lately where not all misspellings of the word "strawberry" reliably give letter counting errors. The general sentiment there is human, but the specific pattern of misspelling is really more unique to LLM's (i.e. different spelling errors would impact humans versus LLM's).
The part that makes it challenging is that we don't know these "triggers." You could have a prompt that has 95% accuracy, but that inexplicably drops to 50% if the word "green" is in the question (or something like that).