This is an incredibly silly comparison. It amounts to claiming that a Ford Pinto...

vohk · 2026-05-30T16:44:05 1780159445

I think your analogy makes the opposite case better. A Rolls-Royce and a Pinto have the same real commute time because horsepower isn't the bottleneck, and they both get passengers from point to point. Sure the Pinto explodes a bit but much like the actuaries at Ford, you might well judge the cost of an occasional explosion to be a trade-off you can easily compensate for.

I would argue the process these days has more to do with the harness than the model, at least when we're talking about the SOTA options. Claude Code's biggest advantage isn't Opus, rather it's the shared knowledge the community has been building and sharing around using it effectively. Almost all of the out-of-the-box tutorials and skills and frameworks are build for Claude first, then Codex maybe.

I'd go further and say that CC and Codex are not even the best harnesses available, they just offer the most subsidized rate plans.

ethbr1 · 2026-05-30T17:32:45 1780162365

> Claude Code's biggest advantage isn't Opus, rather it's the shared knowledge the community has been building and sharing around using it effectively.

This. Never underestimate the ability of a large number of power users to substantially improve the actual utility of a complex software product.

They always have more time (and sometimes more skill) than a product's developers.

Sometimes the quantity of monkeys matters more than the quality of the typewriters.

amazingamazing · 2026-05-30T16:16:51 1780157811

In my test the prompt was the same and all suggestions were auto accepted so indeed there was no difference other than model and harness. The amount of characters typed and interaction with the harnesses were exactly the same.

kbenson · 2026-05-30T17:53:39 1780163619

To keep with the analogy, isn't that sort of like testing two cars by having them both drive the same few hundred foot stretch of new road at the posted speed limit of 35 MPH? You will test some things doing that, but not particularly well, and hardly all the things people find interesting and useful for comparing the performance of cars.

To bring ng this back to the discussion at hand (and to be redundant, as it's been mentioned here already), there are many aspects of using an LLM that are not purely about the output from a single or few well formed prompts. Additionally, if the end results are very similar, these othrr aspects will have an outsized influence on people's perspective of the tools, as they're the only differences worth choosing one model over another.