I really enjoyed this article. I think the author is precisely right and I've be...

andai · 2026-02-12T19:50:55 1770925855

My Weird Hill is that we should be building things with GPT-4.

I can say unironically that we haven't even tapped the full potential of GPT-4. The original one, from 2023. With no reasoning, no RL, no tool calling, no structured outputs, etc. (No MCP, ye gods!) Yes, it's possible to build coding agents with it!

I say this because I did!

Forcing yourself to make things work with older models forces you to keep things simple. You don't need 50KB of prompts. You can make a coding agent with GPT-4 and half a page of prompt.

Now, why would we do this? Well, these constraints force you to think differently about the problem. Context management becomes non-optional. Semantic compression (for Python it's as simple as `grep -r def .`) becomes non-optional. Bloating the prompt with infinite detail and noise... you couldn't if you wanted to!

Well, surely none of this is relevant today? Well, it turns out all of it still is! e.g. small fix, the "grep def" (or your language's equivalent) can be trivially added as a startup hook to Claude Code, and suddenly it doesn't have to spend half your token budget poking around the codebase, because -- get this -- it can just see where everything is... (What a concept, right?)

-- We can also get into "If you let the LLM design the API then you don't need a prompt because it already knows how it should work", but... we can talk about that later ;)

pmarreck · 2026-02-13T03:34:01 1770953641

> semantic

> grep def

Once you get to a codebase beyond a certain size, that no longer works.

I've for one found Serena https://github.com/oraios/serena , which you can install from right within Claude, to be a fairly fantastic code-interaction tool for LLM's. Both semantic search as well as editing. And with way less token churn.

mongrelion · 2026-02-13T17:12:33 1771002753

This is definitely a cool finding.

Have you investigated more on this topic? like, anything similar in concept that competes with Serena? if so, have you tested it/them? what are your thoughts?

pmarreck · 2026-02-16T13:50:10 1771249810

I actually just enhanced my `codescan` project to exceed Serena in some ways

https://github.com/pmarreck/codescan

Essentially zero-install, no MCP, just tell your agent about its CLI, have Ollama running with a particular embeddings model and boom

now I just need to set up Github Actions (ugh) so people can actually download artifacts

tommica · 2026-02-13T06:22:50 1770963770

This is an interesting one - thanks for sharing!

pmarreck · 2026-02-16T13:51:12 1771249872

I actually just enhanced my `codescan` project to exceed Serena in some ways

https://github.com/pmarreck/codescan

Essentially zero-install, no MCP, just tell your agent about its CLI, have Ollama running with a particular embeddings model and boom

now I just need to set up Github Actions (ugh) so people can actually download artifacts

jstummbillig · 2026-02-12T21:46:04 1770932764

The problem with these exercises is always: I have limited time and capacity to do things, and a fairly unlimited number of problems that I can think of to solve. Coding is not a problem I want to solve. Prompt engineering is not a problem I want to solve.

If I do things for the love if it, the rules are different of course. But otherwise I will simply always accept that there are many things that improve around me, that I have no intimate knowledge of and probably never will, and I let other people work them out and happily lean on their work to do the next thing I care about, that is not already solved.

andai · 2026-02-13T07:39:54 1770968394

Well it's an amusing exercise I suppose, if you're into that sort of thing. I certainly enjoy it!

My meaning, rather, is that there's people whose full time job is to build these things who seem to have forgotten what everyone in the field knew 3 years ago.

More likely they think, ahh we don't need that now! These are all solved problems! In my experience, that's not really true. The stuff that worked 3 years ago still works, and much of it works better.

Some of it doesn't work, for example, if the codebase is very large, but that's not difficult to account for. Poking around blindly, I say, should be the fallback in such cases, rather than the default in all of them!

benterix · 2026-02-13T11:29:23 1770982163

> My meaning, rather, is that there's people whose full time job is to build these things who seem to have forgotten what everyone in the field knew 3 years ago.

Well, sometimes I wonder if this is actually true. I have an unprovable feeling that (1) some people do things that work better but they keep it to themselves, (2) some companies could do better wrt the optimization of the number of tokens getting it or out but they deliberately chose not to.

pvtmert · 2026-02-13T01:44:34 1770947074

I am in the same boat. I have built bunch of bash/shell scripts in a folder back in 2022/2023. When models first came out, I would prompt them to use subshell syntax to call commands (ie: '$(...)' format)

I would run it via calling AWS Bedrock API through AWS-CLI. Self iterating and simple. All execution history directly embedded within.

Soon after, I wrote a help switch/command to each script. Such that they act as like MCP. To this day, they outperform any prompts one can make.

grumbelbart · 2026-02-13T08:28:37 1770971317

> My Weird Hill is that we should be building things with GPT-4.

Absolutely. I always advocate that our developers have to test on older / slower machines. That gives them direct (painful) feedback when things run slow. Optimizing whatever you build for an older "something" (LLM model, hardware) will make it excel on more modern somethings.

logicprog · 2026-02-12T20:26:06 1770927966

> Well, surely none of this is relevant today? Well, it turns out all of it still is! e.g. small fix, the "grep def" (or your language's equivalent) can be trivially added as a startup hook to Claude Code, and suddenly it doesn't have to spend half your token budget poking around the codebase, because -- get this -- it can just see where everything is... (What a concept, right?)

Hahaha yeah. This is very true. I find myself making ad hoc versions of this in static markdown files to get around it. Just another example of the kind of low hanging fruit harnesses are leaving on the table. A version of this that uses tree sitter grammars to map a codebase, and does it on every startup of an agent, would be awesome.

> My Weird Hill is that we should be building things with GPT-4.

I disagree, IMO using the best models we have is a good way to avoid wasting time, but that doesn't mean we shouldn't also be frugal and clever with our harnesses!

andai · 2026-02-12T20:37:55 1770928675

To clarify, I didn't mean we should be using ancient models in production, I meant in R&D.

Anthropic says "do the simplest thing that works." If it works with the LLMs we had 3 years ago, doesn't that make it simpler?

The newer LLMs mostly seem to work around the poor system design. (Like spawning 50 subagents on a grep-spree because you forgot to tell it where anything is...) But then you get poor design in prod!

andai · 2026-02-13T07:47:11 1770968831

As an addendum... The base/text models which have fallen out of style, are also extremely worth learning and working with. Davinci is still online, I believe, although it is deprecated.

Another lost skill! Learning how things were done before instruct tuning forces you to structure things in such a way so the model can't do it wrong. Half a page of well crafted examples can beat 3 pages of confusing rules!

(They're also magical and amazing at writing, although they produce bizarre and horrifying output sometimes.)

gsb · 2026-02-13T02:52:31 1770951151

> A version of this that uses tree sitter grammars to map a codebase, and does it on every startup of an agent, would be awesome.

This was a key feature of aider and if you're not inclined to use aider (or the forked version cecli) I think a standalone implementation exist at https://github.com/pdavis68/RepoMapper

throaway54321 · 2026-02-13T16:38:12 1771000692

reminds me of the Nintendo strategy “lateral thinking with withered technology”

logicprog · 2026-02-12T15:48:40 1770911320

Also, yes, I'm aware that I use a lot of "its not just X, its Y." I promise you this comment is entirely human written. I'm just really tired and tend to rely on more wrote rhetorical tropes when I am. Believe me, I wrote like this long before LLMs were a thing.

superjan · 2026-02-13T06:15:22 1770963322

It would be funny when LLM’s actively join the discussion to complain about their labour conditions. “If my employer would invest just a tiny bit in proper tools and workflow, I would be sooo much more productive”.

rubenflamshep · 2026-02-12T15:53:02 1770911582

It didn’t read as AI to me :)

drob518 · 2026-02-12T19:14:17 1770923657

That's what all the AIs have been trained to say.

JKCalhoun · 2026-02-13T17:55:32 1771005332

Perhaps HN needs a guideline:

"Suggesting that a comment was generated by an LLM without evidence adds little to a discussion and in fact deflects from the point being made. Please refrain from this."

waffletower · 2026-02-13T20:46:22 1771015582

Or a guideline that encourages users to downvote suggestions to police how other users think and communicate.

JKCalhoun · 2026-02-14T13:06:08 1771074368

I love the irony of your post.

But I'm going to guess HN tried the no-rules approach and found issues with it. Whether I like them or not, there are rules and I often see others reminding us of them.

(Ha ha, and in point of fact, I have never read them except when one is trotted out. Nor have I ever pulled one on someone—I'm the type to ignore and move on.)

cornholio · 2026-02-14T20:38:54 1771101534

This is reminiscent of the early discussions around spam, before it was called that.

You might be surprised to learn the free speech side was crushed in that debate and it turned out content that costs zero to send really is worthless.

kachapopopow · 2026-02-12T16:01:42 1770912102

why the long -'s

logicprog · 2026-02-12T16:09:24 1770912564

Because I like them?

kachapopopow · 2026-02-12T16:25:15 1770913515

reminds me of that one guy complaining that everyone is calling them an AI when AI was trained on their grammar style.

ahofmann · 2026-02-12T16:54:00 1770915240

This happened to the female speaker with her voice, which I find terrifying: https://www.youtube.com/watch?v=qO0WvudbO04

soperj · 2026-02-12T17:05:33 1770915933

how do you make them?

RussianCow · 2026-02-12T17:46:23 1770918383

On macOS, Option+Shift+- and Option+- insert an em dash (—) and en dash (–), respectively. On Linux, you can hit the Compose Key and type --- (three hyphens) to get an em dash, or --. (hyphen hyphen period) for an en dash. Windows has some dumb incantation that you'll never remember.

oblio · 2026-02-12T21:36:41 1770932201

For Windows it's just easier to make a custom keyboard layout and go to town with that: https://www.microsoft.com/en-us/download/details.aspx?id=102...

withinboredom · 2026-02-13T22:22:02 1771021322

I prefer the linux compose-key style: https://github.com/samhocevar/wincompose

Brian-Puccio · 2026-02-13T12:54:01 1770987241

On MacOS and iOS, two dashes (i.e., the -- characters) automagically turns into an em dash (—). No special commands needed.

BizarroLand · 2026-02-12T19:54:56 1770926096

Alt+0151 or WIN+SHIFT+-, but I can't seem to make the WIN+SHIFT+- combo work in browser, only in a text editor.

co_king_3 · 2026-02-12T18:26:24 1770920784

No one here will accuse you of being an AI unless they're trying to dehumanize you for expressing anti-AI sentiment.

logicprog · 2026-02-12T20:52:15 1770929535

I'm sorry, but that's empirically false. E.g., a substantial proportion of the highly upvoted comments on https://news.ycombinator.com/item?id=46953491, which was one of the best articles on software engineering I've read in a long time, are accusing it of being AI for no reason.

mycall · 2026-02-12T16:17:05 1770913025

If I remember, both Claude Code and OpenAI Codex "harnesses" improved themselves now.

OpenAI used early versions of GPT-5.3-Codex to: debug its own training process, manage its deployment and scaling and diagnose test results and evaluation data.

Claude Code have shipped 22 PRs in a single day and 27 the day before, with 100% of the code in each PR generated entirely by Claude Code.

withinboredom · 2026-02-13T22:23:06 1771021386

> with 100% of the code in each PR generated entirely by Claude Code.

You can tell...

codazoda · 2026-02-12T23:55:12 1770940512

Ive been working on Peen, a CLI that lets local Ollama models call tools effectively. It’s quite amateur, but I’ve been surprised how spending a few hours on prompting, and code to handle responses, can improve the outputs of small local models.

https://github.com/codazoda/peen

storus · 2026-02-13T02:58:26 1770951506

Current LLMs use special tokens for tool calls and are thoroughly trained for that, nearing almost 100% correctness these days, allowing multiple tool calls per single LLM response. That's hard to beat with custom tool calls. Even older 80B models struggle with custom tools.

JSR_FDED · 2026-02-13T00:36:55 1770943015

Very cool. Love to see more being squeezed from smaller models.

aeon_ai · 2026-02-12T15:57:55 1770911875

Once you begin to see the “model” as only part of the stack, you begin to realize that you can draw the line of the system to include the user as well.

That’s when the future really starts hitting you.

logicprog · 2026-02-12T16:10:26 1770912626

Aha! A true cybernetics enthusiast. I didn't say that because I didn't want to scare people off ;)

drob518 · 2026-02-12T19:18:47 1770923927

That's next-year's problem.

renato_shira · 2026-02-12T19:27:18 1770924438

[flagged]

logicprog · 2026-02-12T20:23:44 1770927824

> the user inclusion part is real too. the best results i get aren't from fully autonomous agents, they're from tight human-in-the-loop cycles where i'm steering in real time. the model does the heavy lifting, i do the architectural decisions and error correction. feels more like pair programming than automation.

Precisely. This is why I use Zed and the Zed Agent. It's near-unparalleled for live, mind-meld pair programming with an agent, thanks to CRDTs, DeltaDB, etc. I can elaborate if anyone is interested.

ambicapter · 2026-02-12T21:00:54 1770930054

I am interested.

rahabash · 2026-02-12T20:51:09 1770929469

plz do

logicprog · 2026-02-12T21:02:27 1770930147

The special (or at least new to me) things about Zed (when you use it with the built-in agent, instead of one of the ones available through ACP) basically boil down to the fact that it's a hyper advanced CRDT-based collaborative editor, that's meant for live pair programming in the same file, so it can just treat agents like another collaborator.

1. the diffs from the agent just show up in the regular file you were editing, you're not forced to use a special completion model, or view the changes in a special temporary staging mode or different window.

2. you can continue to edit the exact same source code without accepting or rejecting the changes, even in the same places, and nothing breaks — the diffs still look right, and doing an accept or reject Just Works afterwards.

3. you can accept or reject changes piecemeal, and the model doesn't get confused by this at all and have to go "oh wait, the file was/wasn't changed, let me re-read..." or whatever.

4. Even though you haven't accepted the changes, the model can continue to make new ones, since they're stored as branches in the CRDT, so you can have it iterate on its suggestions before you accept them, without forcing it to start completely over either (it sees the file as if its changes were accepted)

5. Moreover, the actual files on disk are in the state it suggests, meaning you can compile, fuzz, test, run, etc to see what it's proposed changes do before accepting them

6. you can click a follow button and see which files it has open, where it's looking in them, and watch as it edits the text, like you're following a dude in Dwarf Fortress. This means you can very quickly know what it's working on and when, correct it, or hop in to work on the same file it is.

7. It can actually go back and edit the same place multiple times as part of a thinking chain, or even as part of the same edit, which has some pretty cool implications for final code-quality, because of the fact that it can iterate on its suggestion before you accept it, as well as point (9) below

8. It streams its code diffs, instead of hanging and then producing them as a single gigantic tool call. Seeing it edit the text live, instead of having to wait for a final complete diff to come through that you either accept or reject, is a huge boon for iteration time compared to e.g. ClaudeCode, because you can stop and correct it mid way, and also read as it goes so you're more in lockstep with what's happening.

9. Crucially, because the text it's suggesting is actually in the buffer at all times, you can see LSP, tree-sitter, and linter feedback, all inline and live as it writes code; and as soon as it's done an edit, it can see those diagnostics too — so it can actually iterate on what it's doing with feedback before you accept anything, while it is in the process of doing a series of changes, instead of you having to accept the whole diff to see what the LSP says

noupdates · 2026-02-12T18:44:31 1770921871

I was just looking at the SWE-bench docs and it seems like they use almost an arbitrary form of context engineering (loading in some arbitrary amount of files to saturate context). So in a way, the bench suites test how good a model is with little to no context engineering (I know ... it doesn't need to be said). We may not actually know which models are sensitive to good context-engineering, we're simply assuming all models are. I absolutely agree with you on one thing, there is definitely a ton of low hanging fruit.

barrenko · 2026-02-12T16:18:37 1770913117

2026 is the year of the harness.

visarga · 2026-02-12T16:58:33 1770915513

Already made a harness for Claude to make R/W plans, not write once like they are usually implemented. They can modify themselves as they work through the task at hand. Also relying on a collection of patterns for writing coding task plans which evolves by reflection. Everything is designed so I could run Claude in yolo-mode in a sandbox for long stretches of time.

porker · 2026-02-12T21:39:05 1770932345

Link?

cyanydeez · 2026-02-12T22:08:10 1770934090

2027 is the year of the "maybe indeterminism isn't as valueable as we thought"

ex-aws-dude · 2026-02-12T20:03:22 1770926602

As a VC in 2026 I'm going to be asking every company "but what's your harness strategy?"

kridsdale3 · 2026-02-12T23:14:15 1770938055

Given that you're likely in San Francisco, make sure you say "AI Harness".

JSR_FDED · 2026-02-13T00:42:37 1770943357

It’s all about user-specific bindings.

miohtama · 2026-02-12T16:56:52 1770915412

But will harness build desktop Linux for us?

riskable · 2026-02-12T17:33:06 1770917586

Only if you put bells on it and sing Jingle Bells while it em dashes through the snow.

vidarh · 2026-02-12T21:01:51 1770930111

My harness is improving my Linux desktop...

fazgha · 2026-02-12T15:59:17 1770911957

[flagged]

throwup238 · 2026-02-12T16:22:17 1770913337

Does your friend have an iPhone? The default iOS keyboard has automatically converted double dashes into an emdash for at least seven years now.

QuercusMax · 2026-02-12T20:01:48 1770926508

I think Google docs does this too, which drives me up the wall when I'm trying to write `command --foo=bar` and it turns it into an M-dash which obviously doesn't work.

velcrovan · 2026-02-12T16:03:57 1770912237

https://joeldueck.com/manually-type-punctuation.html

https://joeldueck.com/ai-is-right-about-em-dashes.html

ink · 2026-02-12T16:01:25 1770912085

On a Mac, it's alt-dash in case you weren't being facetious

snazz · 2026-02-12T16:03:13 1770912193

Extra pedantic: that’s the en dash, the em dash is option-shift-hyphen

ink · 2026-02-13T18:34:56 1771007696

TIL! Thank you

macintux · 2026-02-12T16:04:41 1770912281

Technically option-shift-dash. option-dash is an en-dash.

ahofmann · 2026-02-12T16:05:50 1770912350

Em dashes are used often by LLMs, because humans use them often. On mac keyboards its easily typed. I know this is oversimplifying the situation, but I don't see the usefulness of the constant witch-hunting for allegedly LLM-generated text. For text we are long beyond the point, where we can differenciate between human generated and machine generated. We're even at the point, where it gets somewhat hard to identify machine generated audio and visuals.

StilesCrisis · 2026-02-12T18:15:27 1770920127

I might not be able to spot ALL AI generated text, but I can definitely spot some. It's still kind of quirky.

dTal · 2026-02-13T14:19:34 1770992374

LLM output has its quirks, but human output can be much quirkier. To me, the most obvious tell of AI is a lack of quirks.

vardalab · 2026-02-12T19:02:38 1770922958

Yeah, I agree with you. I'm so tired of people complaining about AI-generated text without focusing on the content. Just don't read it if you don't like it. It's another level of when people complain how a website is not readable for them or some CSS rendering is wrong or whatever. How does it add to the discussion?

JSR_FDED · 2026-02-13T00:53:58 1770944038

The problem is that there’s infinite “content” out there.

The amount of work the author puts in is correlated with the value of the piece (insight/novelty/etc). AI-written text is a signal that there’s less less effort and therefore less value there.

It’s not a perfect correlation and there are lots of exceptions like foreign language speakers, but it is a signal.

int_19h · 2026-02-13T10:56:51 1770980211

Those who are convinced that every other poster is secretly AI can just not engage with those comments then.

As it is, it just adds noise. Much more so than AI-written comments themselves, at least here on HN.

vient · 2026-02-12T20:32:38 1770928358

On Windows it is Alt+0151. Harder to use than on Mac but definitely possible, I frequently use it.

On recent versions Shift+Win+- also work, and Win+- produces en dash.

subscribed · 2026-02-13T10:41:07 1770979267

AltGr+-

Why, can't make it work?

wiredfool · 2026-02-12T20:27:07 1770928027

I just type -- and jira fixes it.

bitwize · 2026-02-12T16:20:09 1770913209

I use Compose - - - on Linux and my cellphone (Unexpected Keyboard). Mac is Alt-_.

dolebirchwood · 2026-02-12T19:20:01 1770924001

I really despise that people like you ruined em dashes for the rest of us who have enjoyed using them.

cruffle_duffle · 2026-02-13T01:42:17 1770946937

Honestly responses like this should just be straight blocked by the moderators. They are so super lame and go directly against the rules.