codex gtp-5.5 is far superior to opus 4.7 working on large projects

bob1029 · 2026-05-30T15:08:53 1780153733

I strongly believe the reason gpt-5.x performs so well on large projects is because of the focused training they've done on their dedicated apply_patch primitive.

The official implementation of apply_patch is well thought out. It is a two-phase process that will not actually make any changes until all files in the change set are not ambiguous. The pre-commit error feedback usually fixes anchoring issues with one or two additional attempts. It generally goes something like:

  Reading file A L1:154
  Reading file B L1:123
  Attempting to apply patch... 
  [anchor errors for both A & B]
  Reading file A L43:67
  Reading file B L50:74
  Attempting to apply patch... 
  Patch succeeded! Running compilation & unit tests...

The anchor error feedback helps massively because in this implementation it also returns the current line numbers where the problem was found.

Techniques that replace the whole file or depend on find-replace are useful in more isolated contexts. However, when you need to refactor 20+ files, something like apply_patch is what you want. Anything that depends on specific line numbers for actual replacement targets is a total dead end for complex edit scenarios.

https://developers.openai.com/api/docs/guides/tools-apply-pa...

meowface · 2026-05-30T14:06:53 1780150013

GPT-5.5 is the better programmer but Opus 4.8 remains the better system architect and product designer.

Codex is very "miss the forest for the trees", but is much better at successfully making large changes in large codebases. Claude Code makes more mistakes, but has more taste and a better grasp on idiomatic and elegant software development.

If you can afford to, I recommend juggling both.

theturtletalks · 2026-05-30T14:16:26 1780150586

Great analysis and follows my experience as well. Codex is better when you know how you want the design and the architecture and you drive the agent a lot more aggressively. Claude Code feels like more autopilot so executives and users who didn’t code before AI like it a lot more.

But I feel like an expert who can drive GPT aggressively will out perform Opus. It’s why some smart people I know are opting for GPT and have fallen off on Opus. It’s like asking an F1 driver to sit in a taxi.

sobellian · 2026-05-30T15:13:43 1780154023

Opus 4.7 (haven't tried 4.8) just really struggles writing correct code for complicated (i.e. valuable) work. I can handle architecture, which takes <1% of my time anyway. But writing code that's wrong is a cardinal sin. I've had much more luck with GPT 5.5 so far.

CuriouslyC · 2026-05-30T14:47:11 1780152431

This is exactly right. Claude has baked in autonomy and preferences that let it handle underspecified prompts elegantly, which makes it seem smarter to people who like to prompt that way, but it also ignores instructions and fights you on things, which makes it a bad model for people who know what they want to do and specify it.

bayindirh · 2026-05-30T14:12:33 1780150353

I find arguing that a complex weighted graph has a taste is interesting.

This is not a jab, but a genuine curiosity of mine.

chronofar · 2026-05-30T14:56:46 1780153006

More interesting than arguing a jumble of electrochemical reactions have taste? That may seem more readily familiar but is no less strange if you prod at it. Nonetheless it’s difficult to argue either don’t produce output that has qualities of discernment (ie taste).

jmcodes · 2026-05-30T16:25:28 1780158328

Isn't it just arguing that one complex weighted graph was tuned to output tokens that more align with what current day users would define as 'taste'?

I don't think it necessarily says anything about a model itself having 'taste' in some subjective way.

If the fashion changes would the model update with it without retraining? No. So the model doesn't have 'taste' in that sense. It has alignment to current human definitions of taste.

knollimar · 2026-05-30T14:21:24 1780150884

The roulette pockets for the model are bigger for some outputs than others. Draw a big enough black box around it and a different one around humans and it's insistinguishable.

meowface · 2026-05-31T03:25:03 1780197903

It is more capable of writing code I find tasteful and maintainable. It is debatable to what extent it itself has taste. Its outputs just suit my taste more than Codex's do, even though Codex introduces fewer bugs.

alstonite · 2026-05-30T14:17:55 1780150675

The taste that the complex weighted graph was trained on was better for one than the other I think is the long winded way to say it

lucamark · 2026-05-30T14:31:27 1780151487

I'm experiencing the same. Codex gtp-5.5 has more brilliant intuitions, write less code, i.e. it identifies the exact point in which the modification shall be done. Nevertheless, huge improvements on personality from opus 4.7 (it was too accomodating) to opus 4.8

vb-8448 · 2026-05-30T15:11:09 1780153869

My problem with codex/gpt that is too verbose (mostly js and python): a lot of helper functions, a lot of 1 or 2 line functions used in 1 place only, a lot of types or proxy like objects.

I have specific skills for trying to avoid this, but nevertheless I spent half of the time fighting with its verbosity.

Currently, I'm trying to scaffold the functions/classes I know I need with NotImpelmented and ask it to implement only inside those specific places. It's a little bit better, but I still have to fight with function in functions definitions ...

RA_Fisher · 2026-05-30T14:08:29 1780150109

In what ways? LM Arena has Opus 4.7 w/ 1567 -/+ 7 vs. 1505 -/+ 10 from GPT-5.5 Codex in code. I'm currently using both.

Admittedly my recent experience tilts Opus now 4.8, but you and others have my interest piqued re: GPT-5.5 Codex so I'm trying that more now.

spongebobstoes · 2026-05-30T15:43:58 1780155838

arena is not a good benchmark, it is very susceptible to sycophancy

the__alchemist · 2026-05-30T14:16:26 1780150586

You're using last week's model; Opus 4.7 is old news. Opus 6.9 is the new hotness; it is a better product manager than GPT, and has more X productivity. It replaced our junior dev team, and tells me my hair looks good.

malfist · 2026-05-30T15:52:52 1780156372

Your research finding LLMs ineffective is invalid because you used 6.9. The current SOTA is 6.91 and it's leaps and bounds better that yesterday's 6.9

the__alchemist · 2026-05-30T16:16:08 1780157768

Fuck; you are right.

dangus · 2026-05-30T14:12:07 1780150327

Opus 4.7 is not the current version of Opus.

BoredPositron · 2026-05-30T14:03:14 1780149794

Not everyone is a developer...

_puk · 2026-05-30T14:04:08 1780149848

And 4.7 is so last week..

keyle · 2026-05-30T14:22:13 1780150933

Soon none of us will be! right?

sergiotapia · 2026-05-30T16:06:19 1780157179

My experience as well. Although this week I've moved to Cursor and Composer 2.5. It's so fast that any faults can be iterated on super quickly. The model is just insanely good with code things.

Keyframe · 2026-05-30T16:41:26 1780159286

source?

oofbey · 2026-05-30T14:13:23 1780150403

GPT 5.5 still invents facts rather than looking them up, and manages to come across both as condescending and sycophantic. It feels like talking to a used car salesman.

folkrav · 2026-05-30T14:35:39 1780151739

Funny cause I'm quite literally having this exact issue with 4.8 as we speak. I've been going back and forth with Claude since yesterday afternoon on chopping up, stabilizing and facilitating recovery on a flaky mega-pipeline. Not 5 minutes ago, I had to remind it that two of the solutions it proposed were not possible because the target technology doesn't allow what it wanted to do, despite pointing it to the very docs that says it can't be done in the first place.

As far as its tone... Both feel like sycophantic as hell to me. To be honest, they just all feel so.

theshackleford · 2026-05-30T14:42:02 1780152122

> GPT 5.5 still invents facts rather than looking them up

So does Claude, what’s your point?

I used it and ChatGPT this week in trying to assist troubleshooting a complex DB related issue and Claude had to apologise no less than three times in which it admitted to talking complete shit.

Just one example of the kind of shit it dribbled:

> I need to be upfront with you. I should not have claimed X as if I knew that for a fact. That was overreach on my part.

gitaarik · 2026-05-31T03:31:43 1780198303

I've noticed that Claude has made less mistakes than in the past. I feel it checks it's own work more rigorously now, and understands it's own claims better and knows how to confirm them.

It rarely happens to me that Claude comes with clearly wrong modifications. Only with quite complicated problems with unclear variable names for example. But usually Claude asks me when something is unclear.

No experience with Codex though