Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The good old "benchmarks just keep saturating" problem.

Anthropic is genuinely one of the top companies in the field, and for a reason. Opus consistently punches above its weight, and this is only in part due to the lack of OpenAI's atrocious personality tuning.

Yes, the next stop for AI is: increasing task length horizon, improving agentic behavior. The "raw general intelligence" component in bleeding edge LLMs is far outpacing the "executive function", clearly.





Shouldn't the next stop be to improve general accuracy, which is what these tools have struggled with since their inception? Until when are "AI" companies going to offload the responsibility on the user to verify the output of their tools?

Optimizing for benchmark scores, which are highly gamed to begin with, by throwing more resources at this problem is exceedingly tiring. Surely they must've noticed the performance plateau and diminishing returns of this approach by now, yet every new announcement is the same.


What "performance plateau"? The "plateau" disappears the moment you get harder unsaturated benchmarks.

It's getting more and more challenging to do that - just not because the models don't improve. Quite the opposite.

Framing "improve general accuracy" as "something no one is doing" is really weird too.

You need "general accuracy" for agentic behavior to work at all. If you have a simple ten step plan, and each step has a 50% chance of an unrecoverable failure, then your plan is fucked, full stop. To advance on those benchmarks, the LLM has to fail less and recover better.

Hallucinations is a "solvable but very hard to solve" problem. Considerable progress is being made on it, but if there's "this one weird trick" that deletes hallucinations, then we sure didn't find it yet. Humans get a body of meta-knowledge for free, which lets them dodge hallucinations decently well (not perfectly) if they want to. LLMs get pathetic crumbs of meta-knowledge and little skill in using it. Room for improvement, but, not trivial to improve.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: