I really enjoyed this article. I think the author is precisely right and I've been saying this for a long time. There's a ton of extremely interesting low hanging fruit that can vastly improve the effectiveness of even currently existing models hiding in how we design our agent harnesses; enough to — at least until we hit diminishing returns — make as much or more of a difference than training new models!
I think one of the things that this confirms, for me at least, is that it's better to think of "the AI" as not just the LLM itself, but the whole cybernetic system of feedback loops joining the LLM and its harness. Because, if the harness can make as much if not more of a difference, when improved, as improvements to the model itself, then they have to be really considered equally important. Not to mention the fact that models are specifically reinforcement learned to use harnesses and harnesses are adapted to the needs of models in general or specific models. So they necessarily sort of develop together in a feedback loop. And then in practice, as they operate, it is a deeply intertwined feedback loop where the entity that actually performs the useful work, and which you interact with, is really the complete system of the two together.
I think thinking like this could not only unlock quantitative performance improvements like the ones discussed in this blog post, but also help us conceive of the generative AI project as actually a project of neurosymbolic AI, even if the most capital intensive and a novel aspect is a neural network; and once we begin to think like that, that unlocks a lot of new options and more holistic thinking and might increase research in the harness area.
My Weird Hill is that we should be building things with GPT-4.
I can say unironically that we haven't even tapped the full potential of GPT-4. The original one, from 2023. With no reasoning, no RL, no tool calling, no structured outputs, etc. (No MCP, ye gods!) Yes, it's possible to build coding agents with it!
I say this because I did!
Forcing yourself to make things work with older models forces you to keep things simple. You don't need 50KB of prompts. You can make a coding agent with GPT-4 and half a page of prompt.
Now, why would we do this? Well, these constraints force you to think differently about the problem. Context management becomes non-optional. Semantic compression (for Python it's as simple as `grep -r def .`) becomes non-optional. Bloating the prompt with infinite detail and noise... you couldn't if you wanted to!
Well, surely none of this is relevant today? Well, it turns out all of it still is! e.g. small fix, the "grep def" (or your language's equivalent) can be trivially added as a startup hook to Claude Code, and suddenly it doesn't have to spend half your token budget poking around the codebase, because -- get this -- it can just see where everything is... (What a concept, right?)
-- We can also get into "If you let the LLM design the API then you don't need a prompt because it already knows how it should work", but... we can talk about that later ;)
Once you get to a codebase beyond a certain size, that no longer works.
I've for one found Serena https://github.com/oraios/serena , which you can install from right within Claude, to be a fairly fantastic code-interaction tool for LLM's. Both semantic search as well as editing. And with way less token churn.
Have you investigated more on this topic? like, anything similar in concept that competes with Serena? if so, have you tested it/them? what are your thoughts?
The problem with these exercises is always: I have limited time and capacity to do things, and a fairly unlimited number of problems that I can think of to solve. Coding is not a problem I want to solve. Prompt engineering is not a problem I want to solve.
If I do things for the love if it, the rules are different of course. But otherwise I will simply always accept that there are many things that improve around me, that I have no intimate knowledge of and probably never will, and I let other people work them out and happily lean on their work to do the next thing I care about, that is not already solved.
Well it's an amusing exercise I suppose, if you're into that sort of thing. I certainly enjoy it!
My meaning, rather, is that there's people whose full time job is to build these things who seem to have forgotten what everyone in the field knew 3 years ago.
More likely they think, ahh we don't need that now! These are all solved problems! In my experience, that's not really true. The stuff that worked 3 years ago still works, and much of it works better.
Some of it doesn't work, for example, if the codebase is very large, but that's not difficult to account for. Poking around blindly, I say, should be the fallback in such cases, rather than the default in all of them!
> My meaning, rather, is that there's people whose full time job is to build these things who seem to have forgotten what everyone in the field knew 3 years ago.
Well, sometimes I wonder if this is actually true. I have an unprovable feeling that (1) some people do things that work better but they keep it to themselves, (2) some companies could do better wrt the optimization of the number of tokens getting it or out but they deliberately chose not to.
I am in the same boat. I have built bunch of bash/shell scripts in a folder back in 2022/2023. When models first came out, I would prompt them to use subshell syntax to call commands (ie: '$(...)' format)
I would run it via calling AWS Bedrock API through AWS-CLI. Self iterating and simple. All execution history directly embedded within.
Soon after, I wrote a help switch/command to each script. Such that they act as like MCP. To this day, they outperform any prompts one can make.
> My Weird Hill is that we should be building things with GPT-4.
Absolutely. I always advocate that our developers have to test on older / slower machines. That gives them direct (painful) feedback when things run slow. Optimizing whatever you build for an older "something" (LLM model, hardware) will make it excel on more modern somethings.
> Well, surely none of this is relevant today? Well, it turns out all of it still is! e.g. small fix, the "grep def" (or your language's equivalent) can be trivially added as a startup hook to Claude Code, and suddenly it doesn't have to spend half your token budget poking around the codebase, because -- get this -- it can just see where everything is... (What a concept, right?)
Hahaha yeah. This is very true. I find myself making ad hoc versions of this in static markdown files to get around it. Just another example of the kind of low hanging fruit harnesses are leaving on the table. A version of this that uses tree sitter grammars to map a codebase, and does it on every startup of an agent, would be awesome.
> My Weird Hill is that we should be building things with GPT-4.
I disagree, IMO using the best models we have is a good way to avoid wasting time, but that doesn't mean we shouldn't also be frugal and clever with our harnesses!
To clarify, I didn't mean we should be using ancient models in production, I meant in R&D.
Anthropic says "do the simplest thing that works." If it works with the LLMs we had 3 years ago, doesn't that make it simpler?
The newer LLMs mostly seem to work around the poor system design. (Like spawning 50 subagents on a grep-spree because you forgot to tell it where anything is...) But then you get poor design in prod!
As an addendum... The base/text models which have fallen out of style, are also extremely worth learning and working with. Davinci is still online, I believe, although it is deprecated.
Another lost skill! Learning how things were done before instruct tuning forces you to structure things in such a way so the model can't do it wrong. Half a page of well crafted examples can beat 3 pages of confusing rules!
(They're also magical and amazing at writing, although they produce bizarre and horrifying output sometimes.)
> A version of this that uses tree sitter grammars to map a codebase, and does it on every startup of an agent, would be awesome.
This was a key feature of aider and if you're not inclined to use aider (or the forked version cecli) I think a standalone implementation exist at https://github.com/pdavis68/RepoMapper
Also, yes, I'm aware that I use a lot of "its not just X, its Y." I promise you this comment is entirely human written. I'm just really tired and tend to rely on more wrote rhetorical tropes when I am. Believe me, I wrote like this long before LLMs were a thing.
It would be funny when LLM’s actively join the discussion to complain about their labour conditions. “If my employer would invest just a tiny bit in proper tools and workflow, I would be sooo much more productive”.
"Suggesting that a comment was generated by an LLM without evidence adds little to a discussion and in fact deflects from the point being made. Please refrain from this."
But I'm going to guess HN tried the no-rules approach and found issues with it. Whether I like them or not, there are rules and I often see others reminding us of them.
(Ha ha, and in point of fact, I have never read them except when one is trotted out. Nor have I ever pulled one on someone—I'm the type to ignore and move on.)
On macOS, Option+Shift+- and Option+- insert an em dash (—) and en dash (–), respectively. On Linux, you can hit the Compose Key and type --- (three hyphens) to get an em dash, or --. (hyphen hyphen period) for an en dash. Windows has some dumb incantation that you'll never remember.
I'm sorry, but that's empirically false. E.g., a substantial proportion of the highly upvoted comments on https://news.ycombinator.com/item?id=46953491, which was one of the best articles on software engineering I've read in a long time, are accusing it of being AI for no reason.
If I remember, both Claude Code and OpenAI Codex "harnesses" improved themselves now.
OpenAI used early versions of GPT-5.3-Codex to: debug its own training process, manage its deployment and scaling and diagnose test results and evaluation data.
Claude Code have shipped 22 PRs in a single day and 27 the day before, with 100% of the code in each PR generated entirely by Claude Code.
Ive been working on Peen, a CLI that lets local Ollama models call tools effectively. It’s quite amateur, but I’ve been surprised how spending a few hours on prompting, and code to handle responses, can improve the outputs of small local models.
Current LLMs use special tokens for tool calls and are thoroughly trained for that, nearing almost 100% correctness these days, allowing multiple tool calls per single LLM response. That's hard to beat with custom tool calls. Even older 80B models struggle with custom tools.
Once you begin to see the “model” as only part of the stack, you begin to realize that you can draw the line of the system to include the user as well.
> the user inclusion part is real too. the best results i get aren't from fully autonomous agents, they're from tight human-in-the-loop cycles where i'm steering in real time. the model does the heavy lifting, i do the architectural decisions and error correction. feels more like pair programming than automation.
Precisely. This is why I use Zed and the Zed Agent. It's near-unparalleled for live, mind-meld pair programming with an agent, thanks to CRDTs, DeltaDB, etc. I can elaborate if anyone is interested.
The special (or at least new to me) things about Zed (when you use it with the built-in agent, instead of one of the ones available through ACP) basically boil down to the fact that it's a hyper advanced CRDT-based collaborative editor, that's meant for live pair programming in the same file, so it can just treat agents like another collaborator.
1. the diffs from the agent just show up in the regular file you were editing, you're not forced to use a special completion model, or view the changes in a special temporary staging mode or different window.
2. you can continue to edit the exact same source code without accepting or rejecting the changes, even in the same places, and nothing breaks — the diffs still look right, and doing an accept or reject Just Works afterwards.
3. you can accept or reject changes piecemeal, and the model doesn't get confused by this at all and have to go "oh wait, the file was/wasn't changed, let me re-read..." or whatever.
4. Even though you haven't accepted the changes, the model can continue to make new ones, since they're stored as branches in the CRDT, so you can have it iterate on its suggestions before you accept them, without forcing it to start completely over either (it sees the file as if its changes were accepted)
5. Moreover, the actual files on disk are in the state it suggests, meaning you can compile, fuzz, test, run, etc to see what it's proposed changes do before accepting them
6. you can click a follow button and see which files it has open, where it's looking in them, and watch as it edits the text, like you're following a dude in Dwarf Fortress. This means you can very quickly know what it's working on and when, correct it, or hop in to work on the same file it is.
7. It can actually go back and edit the same place multiple times as part of a thinking chain, or even as part of the same edit, which has some pretty cool implications for final code-quality, because of the fact that it can iterate on its suggestion before you accept it, as well as point (9) below
8. It streams its code diffs, instead of hanging and then producing them as a single gigantic tool call. Seeing it edit the text live, instead of having to wait for a final complete diff to come through that you either accept or reject, is a huge boon for iteration time compared to e.g. ClaudeCode, because you can stop and correct it mid way, and also read as it goes so you're more in lockstep with what's happening.
9. Crucially, because the text it's suggesting is actually in the buffer at all times, you can see LSP, tree-sitter, and linter feedback, all inline and live as it writes code; and as soon as it's done an edit, it can see those diagnostics too — so it can actually iterate on what it's doing with feedback before you accept anything, while it is in the process of doing a series of changes, instead of you having to accept the whole diff to see what the LSP says
I was just looking at the SWE-bench docs and it seems like they use almost an arbitrary form of context engineering (loading in some arbitrary amount of files to saturate context). So in a way, the bench suites test how good a model is with little to no context engineering (I know ... it doesn't need to be said). We may not actually know which models are sensitive to good context-engineering, we're simply assuming all models are. I absolutely agree with you on one thing, there is definitely a ton of low hanging fruit.
Already made a harness for Claude to make R/W plans, not write once like they are usually implemented. They can modify themselves as they work through the task at hand. Also relying on a collection of patterns for writing coding task plans which evolves by reflection. Everything is designed so I could run Claude in yolo-mode in a sandbox for long stretches of time.
I think Google docs does this too, which drives me up the wall when I'm trying to write `command --foo=bar` and it turns it into an M-dash which obviously doesn't work.
Em dashes are used often by LLMs, because humans use them often. On mac keyboards its easily typed. I know this is oversimplifying the situation, but I don't see the usefulness of the constant witch-hunting for allegedly LLM-generated text. For text we are long beyond the point, where we can differenciate between human generated and machine generated. We're even at the point, where it gets somewhat hard to identify machine generated audio and visuals.
Yeah, I agree with you. I'm so tired of people complaining about AI-generated text without focusing on the content. Just don't read it if you don't like it.
It's another level of when people complain how a website is not readable for them or some CSS rendering is wrong or whatever. How does it add to the discussion?
The problem is that there’s infinite “content” out there.
The amount of work the author puts in is correlated with the value of the piece (insight/novelty/etc). AI-written text is a signal that there’s less less effort and therefore less value there.
It’s not a perfect correlation and there are lots of exceptions like foreign language speakers, but it is a signal.
I think one of the things that this confirms, for me at least, is that it's better to think of "the AI" as not just the LLM itself, but the whole cybernetic system of feedback loops joining the LLM and its harness. Because, if the harness can make as much if not more of a difference, when improved, as improvements to the model itself, then they have to be really considered equally important. Not to mention the fact that models are specifically reinforcement learned to use harnesses and harnesses are adapted to the needs of models in general or specific models. So they necessarily sort of develop together in a feedback loop. And then in practice, as they operate, it is a deeply intertwined feedback loop where the entity that actually performs the useful work, and which you interact with, is really the complete system of the two together.
I think thinking like this could not only unlock quantitative performance improvements like the ones discussed in this blog post, but also help us conceive of the generative AI project as actually a project of neurosymbolic AI, even if the most capital intensive and a novel aspect is a neural network; and once we begin to think like that, that unlocks a lot of new options and more holistic thinking and might increase research in the harness area.