This paper creates a new benchmark comprised of real remote work tasks sourced from the remote working website Upwork. The best commercial LLMs like Opus, GPT, Gemini, and Grok were tested.
Models released a few days ago, Opus 4.6 and GPT 5.3, haven't been tested yet, but given the performance on other micro-benchmarks, they will probably not be much different on this benchmark.
One of the tasks was "Build an interactive dashboard for exploring data from the World Happiness Report." -- I can't imagine how Opus4.5 could've failed that.
Kinda sus that least known model did best and none of the more recent models were tested. Capabilities grow very fast. So things that now routinely succeed rarely ever succeeded even half a year ago.
That's wildly overestimating what monkeys can do on a typewriter.
It takes a lot to just be mediocre. Which, don't get me wrong, I'll agree current ML is, it's just that "mediocre" is an incomprehensibly huge step up from "random".
Models released a few days ago, Opus 4.6 and GPT 5.3, haven't been tested yet, but given the performance on other micro-benchmarks, they will probably not be much different on this benchmark.