Top AI models fail at >96% of tasks

codexon · 2026-02-07T21:20:06 1770499206

This paper creates a new benchmark comprised of real remote work tasks sourced from the remote working website Upwork. The best commercial LLMs like Opus, GPT, Gemini, and Grok were tested.

Models released a few days ago, Opus 4.6 and GPT 5.3, haven't been tested yet, but given the performance on other micro-benchmarks, they will probably not be much different on this benchmark.

kolinko · 2026-02-07T23:13:08 1770505988

They didn't test Opus at all, only Sonnet.

One of the tasks was "Build an interactive dashboard for exploring data from the World Happiness Report." -- I can't imagine how Opus4.5 could've failed that.

codexon · 2026-02-08T20:24:21 1770582261

Check the link to the study. It has been updated for Opus 4.5.

scotty79 · 2026-02-08T13:40:41 1770558041

Kinda sus that least known model did best and none of the more recent models were tested. Capabilities grow very fast. So things that now routinely succeed rarely ever succeeded even half a year ago.

rsynnott · 2026-02-08T17:46:39 1770572799

I mean performance is so bad across the board that this is likely essentially random. Monkeys accidentally doing a bit of Shakespeare.

ben_w · 2026-02-18T13:39:14 1771421954

That's wildly overestimating what monkeys can do on a typewriter.

It takes a lot to just be mediocre. Which, don't get me wrong, I'll agree current ML is, it's just that "mediocre" is an incomprehensibly huge step up from "random".

tessitore · 2026-02-08T02:08:19 1770516499

This post really should be edited to say 96% of tasks posted on Upwork. Since we would all expect that to happen.

zb3 · 2026-02-08T00:00:23 1770508823

You think they don't? You think AI can replace programmers, today?

Then go ahead and use AI to fix this: https://gitlab.gnome.org/GNOME/mutter/-/issues/4051

stoneforger · 2026-02-08T15:29:38 1770564578

Rewrite it in react it will.

Venn1 · 2026-02-07T21:33:08 1770499988

ChatGPT: when you want spellcheck to argue with you.