More

mcqueenjordan · 2025-12-07T02:22:22 1765074142

As usual with Oxide's RFDs, I found myself vigorously head-nodding while reading. Somewhat rarely, I found a part that I found myself disagreeing with:

> Unlike prose, however (which really should be handed in a polished form to an LLM to maximize the LLM’s efficacy), LLMs can be quite effective writing code de novo.

Don't the same arguments against using LLMs to write one's prose also apply to code? Was this structure of the code and ideas within the engineers'? Or was it from the LLM? And so on.

Before I'm misunderstood as a LLM minimalist, I want to say that I think they're incredibly good at solving for the blank page syndrome -- just getting a starting point on the page is useful. But I think that the code you actually want to ship is so far from what LLMs write, that I think of it more as a crutch for blank page syndrome than "they're good at writing code de novo".

I'm open to being wrong and want to hear any discussion on the matter. My worry is that this is another one of the "illusion of progress" traps, similar to the one that currently fools people with the prose side of things.

averynicepen · 2025-12-07T03:10:45 1765077045

Writing is an expression of an individual, while code is a tool used to solve a problem or achieve a purpose.

The more examples of different types of problems being solved in similar ways present in an LLM's dataset, the better it gets at solving problems. Generally speaking, if it's a solution that works well, it gets used a lot, so "good solutions" become well represented in the dataset.

Human expression, however, is diverse by definition. The expression of the human experience is the expression of a data point on a statistical field with standard deviations the size of chasms. An expression of the mean (which is what an LLM does) goes against why we care about human expression in the first place. "Interesting" is a value closely paired with "different".

We value diversity of thought in expression, but we value efficiency of problem solving for code.

There is definitely an argument to be made that LLM usage fundamentally restrains an individual from solving unsolved problems. It also doesn't consider the question of "where do we get more data from".

>the code you actually want to ship is so far from what LLMs write

I think this is a fairly common consensus, and my understanding is the reason for this issue is limited context window.

twodave · 2025-12-07T03:28:51 1765078131

I argue that the intent of an engineer is contained coherently across the code of a project. I have yet to get an LLM to pick up on the deeper idioms present in a codebase that help constrain the overall solution towards these more particular patterns. I’m not talking about syntax or style, either. I’m talking about e.g. semantic connections within an object graph, understanding what sort of things belong in the data layer based on how it is intended to be read/written, etc. Even when I point it at a file and say, “Use the patterns you see there, with these small differences and a different target type,” I find that LLMs struggle. Until they can clear that hurdle without requiring me to restructure my entire engineering org they will remain as fancy code completion suggestions, hobby project accelerators, and not much else.

mac-attack · 2025-12-07T15:38:33 1765121913

Very well stated.

lukasb · 2025-12-07T02:25:55 1765074355

One difference is that clichéd prose is bad and clichéd code is generally good.

joshka · 2025-12-07T02:29:01 1765074541

Depends on what your prose is for. If it's for documentation, then prose which matches the expected tone and form of other similar docs would be clichéd in this perspective. I think this is a really good use of LLMs - making docs consistent across a large library / codebase.

danenania · 2025-12-07T02:33:44 1765074824

A problem I’ve found with LLMs for docs is that they are like ten times too wordy. They want to document every path and edge case rather focusing on what really matters.

It can be addressed with prompting, but you have to fight this constantly.

pxc · 2025-12-07T14:24:35 1765117475

> A problem I’ve found with LLMs for docs is that they are like ten times too wordy

This is one of the problems I feel with LLM-generated code, as well. It's almost always between 5x and long and 20x (!) as long as it needs to be. Though in the case of code verbosity, it's usually not because of thoroughness so much as extremely bad style.

bigiain · 2025-12-07T03:12:46 1765077166

I think probably my most common prompt is "Make it shorter. No more than ($x) (words|sentences|paragraphs)."

pxc · 2025-12-07T14:39:52 1765118392

I've never been able to get that to work. LLMs can't count; they don't actually know how long their output is.

minimaxir · 2025-12-07T02:30:47 1765074647

I have been testing agentic coding with Claude 4.5 Opus and the problem is that it's too good at documentation and test cases. It's thorough in a way that it goes out of scope, so I have to edit it down to increase the signal-to-noise.

girvo · 2025-12-07T03:38:37 1765078717

The “change capture”/straight jacket style tests LLMs like to output drive me nuts. But humans write those all the time too so I shouldn’t be that surprised either!

mulmboy · 2025-12-07T08:09:43 1765094983

What do these look like?

pmg101 · 2025-12-07T11:14:30 1765106070

  1. Take every single function, even private ones.
  2. Mock every argument and collaborator.
  3. Call the function.
  4. Assert the mocks were  called in the expected way.

These tests help you find inadvertent changes, yes, but they also create constant noise about changes you intend.

senbrow · 2025-12-07T18:25:01 1765131901

These tests also break encapsulation in many cases because they're not testing the interface contract, they're testing the implementation.

ornornor · 2025-12-07T16:31:11 1765125071

Juniors on one of the teams I work with only write this kind of tests. It’s tiring, and I have to tell them to test the behaviour, not the implementation. And yet every time they do the same thing. Or rather their AI IDE spits these out.

girvo · 2025-12-07T21:53:12 1765144392

You beat me to it, and yep these are exactly it.

“Mock the world then test your mocks”, I’m simply not convinced these have any value at all after my nearly two decades of doing this professionally

diamond559 · 2025-12-07T06:24:40 1765088680

If the goal is to document the code and it gets sidetracked and focuses on only certain parts it failed the test. It just further proves llm's are incapable of grasping meaning and context.

dcre · 2025-12-07T02:33:39 1765074819

Docs also often don’t have anyone’s name on them, in which case they’re already attributed to an unknown composite author.

mcqueenjordan · 2025-12-07T04:46:35 1765082795

I guess to follow up slightly more:

- I think the "if you use another model" rebuttal is becoming like the No True Scotsman of the LLM world. We can get concrete and discuss a specific model if need be.

- If the use case is "generate this function body for me", I agree that that's a pretty good use case. I've specifically seen problematic behavior for the other ways I'm seeing it OFTEN used, which is "write this feature for me", or trying to one shot too much functionality, where the LLM gets to touch data structures, abstractions, interface boundaries, etc.

- To analogize it to writing: They shouldn't/cannot write the whole book, they shouldn't/cannot write the table of contents, they cannot write a chapter, IMO even a paragraph is too much -- but if you write the first sentence and the last sentence of a paragraph, I think the interpolation can be a pretty reasonable starting point. Bringing it back to code for me means: function bodies are OK. Everything else gets questionable fast IME.

IgorPartola · 2025-12-07T06:13:49 1765088029

My suspicion is that this is a form of the paradox where you can recognize that the news being reported is wrong when it is on a subject in which you are an expert but then you move onto the next article on a different subject and your trust resumes.

Basically if you are a software engineer you can very easily judge quality of code. But if you aren’t a writer then maybe it is hard for you to judge the quality of a piece of prose.

knollimar · 2025-12-08T14:45:12 1765205112

Gell-Mann amnesia

AlexCoventry · 2025-12-07T04:30:53 1765081853

> I think that the code you actually want to ship is so far from what LLMs write

It depends on the LLM, I think. A lot of people have a bad impression of them as a result of using cheap or outdated LLMs.

themk · 2025-12-07T04:08:52 1765080532

I recently published an internal memo which covered the same point, but I included code. I feel like you still have a "voice" in code, and it provides important cues to the reviewer. I also consider review to be an important learning and collaboration moment, which becomes difficult with LLM code.

cheeseface · 2025-12-07T10:37:13 1765103833

There are cases where I would start the coding process by copy-pasting existing code (e.g. test suites, new screens in the UI) and this is where LLMs work especially well and produce code that is majority of the time production-ready as-is.

A common prompt I use is approximately ”Write tests for file X, look at Y on how to setup mocks.”

This is probably not ”de novo” and in terms of writing is maybe closer to something like updating a case study powerpoint with the current customer’s data.

dcre · 2025-12-07T02:25:37 1765074337

In my experience, LLMs have been quite capable of producing code I am satisfied with (though of course it depends on the context — I have much lower standards for one-off tools than long-lived apps). They are able to follow conventions already present in a codebase and produce something passable. Whereas with writing prose, I am almost never happy with the feel of what an LLM produces (worth noting that Sonnet and Opus 4.5’s prose may be moving up from disgusting to tolerable). I think of it as prose being higher-dimensional — for a given goal, often the way to express it in code is pretty obvious, and many developers would do essentially the same thing. Not so for prose.

make_it_sure · 2025-12-07T08:00:54 1765094454

try Opus 4.5, you'll be surprised. It might be true for past versions of LLMs, but they advanced a lot.

mcqueenjordan · 2025-11-25T15:13:32 1764083612

The jankiness of the original had a lot of charm, almost selling the dystopian absurdity of trying to deploy a service via the janky voice and slightly desync'd audio and animation. I don't think it's just nostalgia, because I felt the same way watching it the first time all those years ago.

I think AI slop is decidedly different, because it just doesn't have the charm. I don't know if I can yet decompose exactly why that is.

ge96 · 2025-11-25T16:27:44 1764088064

https://www.youtube.com/watch?v=3t6L-FlfeaI

mcqueenjordan · 2025-10-02T12:32:12 1759408332

Presumably cargo clippy --fix was the intention. Not all things are fixable, though, which is where LLMs are reasonable for -- the squishy hard-to-autofix things.

mcqueenjordan · 2025-10-02T11:37:13 1759405033

If you want to do great work, that usually happens in environments with minimized politics.

It's probably bad career advice to completely avoid politics (most places aren't doing great work) but it depends on what you're optimizing for.

The problem with everyone getting into the political game is that then we have everyone talking and noone building.

mcqueenjordan · 2025-06-18T02:13:32 1750212812

One of my favorite LLM uses is to feed it this essay, then ask it to assume the persona of the grug-brained developer and comment on $ISSUE_IM_CURRENTLY_DEALING_WITH. Good stress relief.

CactusRocket · 2025-06-18T13:45:55 1750254355

I am not very proficient with LLMs yet, but this sounds awesome! How do you do that, to "feed it this essay"? Do you just start the prompt with something like "Act like the Grug Brained Developer from this essay <url>"?

rm_-rf_slash · 2025-06-18T16:59:12 1750265952

Could put it in a ChatGPT project description or Cursor rules to avoid copy pasting every time.

mcqueenjordan · 2025-06-16T03:49:26 1750045766

I haven't read all the comments and I'm sure someone else made a similar point, but my first thought was the flip the direction of the statement: "Waymo rides cost more than Uber or Lyft /because/ people are willing to pay more".

mcqueenjordan · 2025-03-20T05:10:29 1742447429

Usually if you’re using it, it’s because you’re forced to.

In my experience, the best strategy is to minimize your use of it — call out to binaries or shell scripts and minimize your dependence on any of the GHA world. Makes it easier to test locally too.

sepositus · 2025-03-20T05:21:22 1742448082

This is what I do. I've written 90% of the logic into a Go binary and GitHub Actions just calls out to it at certain steps. It basically just leaves GHA doing the only thing it's decent at...providing a local UI for pipelines. The best part is you get unit tests, can dogfood the tool in its own pipeline, and can run stuff locally (by just having the CLI nearby).

noisy_boy · 2025-03-20T05:27:20 1742448440

Makes migrations easier too; better to let gitHub or gitlab etc to just be the platform to host source code and trigger events which you decide how to deal with. Your CI itself should be another source controlled repo that provides the features for the application code's thin CI layer to invoke and use. That allows you to be able to run your CI locally in a pretty realistic manner too.

I have done something similar with Jenkins and groovy CI library used by Jenkins pipeline. But it wasn't super simple since a lot of it assumed Jenkins. I wonder if there is a more cleaner open source option that doesn't assume any underlying platform.

raffraffraff · 2025-03-20T09:05:52 1742461552

> Usually if you’re using it, it’s because you’re forced to.

Like teams.

mcqueenjordan · 2025-03-03T06:58:39 1740985119

This is a silly extreme case, but it's kind of an absurd example of what happens when you live a life devoid of the principle of charity[1].

I think tons of interpersonal engineering issues boil down to a failure to apply this principle.

[1]: https://en.wikipedia.org/wiki/Principle_of_charity

mcqueenjordan · 2025-02-15T13:30:56 1739626256

> But I just checked and, unsurprisingly, 4o seems to do reasonably well at generating Semgrep rules? Like: I have no idea if this rule is actually any good. But it looks like a Semgrep rule?

This is the thing with LLMs. When you’re not an expert, the output always looks incredible.

It’s similar to the fluency paradox — if you’re not native in a language, anyone you hear speak it at a higher level than yourself appears to be fluent to you. Even if for example they’re actually just a beginner.

The problem with LLMs is that they’re very good at appearing to speak “a language” at a higher level than you, even if they totally aren’t.

tptacek · 2025-02-15T17:27:32 1739640452

Hold on, hold on. You're missing a step here.

I agree completely that an LLM's first attempt to write a Semgrep rule is likely as not to be horseshit. That's true of everything an LLM generates. But I'm talking about closed-loop LLM code generation. Unlike legal arguments and medical diagnoses, you can hook an LLM up to an execution environment and let it see what happens when the code it generates runs. It then iterates, until it has something that works.

Which, when you think about it, is how a lot of human-generated code gets written too.

So my thesis here does not depend on LLMs getting things right the first time, or without assistance.

bambax · 2025-02-15T20:45:28 1739652328

The problem is what one means by "works". Is it just that it runs without triggering exceptions here and there?

One has to know, and understand, what the code is supposed to be doing, to evaluate it. Or use tests.

But LLMs love to lie so they can't be trusted to write the tests, or even to report how the code they wrote passed the tests.

In my experience the way to use LLMs for coding is exactly the opposite: the user should already have very good knowledge of the problem domain as well as the language used, and just needs to have a conversation with someone on how to approach a specific implementation detail (or help with an obscure syntax quirk). Then LLMs can be very useful.

But having them directly output code for things one doesn't know, in a language one doesn't know either, hoping they will magically solve the problem by iterating in "closed loops", will result in chaos.

tptacek · 2025-02-15T21:58:20 1739656700

It clearly does not result in chaos. This is an "I believe my lying eyes" situation, where I can just see that I can get an agent-y LLM codegen setup to generate a sane-looking working app in a language I'm not fluent in.

The thing everyone thinks about with LLM codegen is hallucination. The biggest problem for LLMs with hallucination is that there are no guardrails; it can just say whatever. But an execution environment provides a ground truth: code works or it doesn't, a handler path generates an exception or it doesn't, a lint rule either compiles and generates workable output or it doesn't.

bambax · 2025-02-16T08:52:49 1739695969

> code works or it doesn't

It seems you're deliberately confusing "works" with "runs". They're different things.

danielbln · 2025-02-15T18:15:41 1739643341

That's also the problem with these conversations. Some people evaluate zero-shot promoted code oozing out of gpt-3.5, others plug Sonnet into an IDE with access to terminal, LSP, diagnostics etc. crunching through a problem in an agentic self improvement loop. Those two approaches will generate very different quality levels of code.

vlovich123 · 2025-02-15T20:52:17 1739652737

An LLM though doesn’t truly understand the goal AND it frequently gets into circular loops it can’t get out of when the solution escapes its capability rather than asking for help. Hopefully it’ll get fixed but some of this stuff is an architectural problem rather than just iterating on the transformer idea.

tptacek · 2025-02-15T22:00:24 1739656824

That's totally true, but it's also a small amount of Python code in the agent scaffolding to ensure that it bails on those kinds of loops. Meanwhile, for something like Semgrep, the status quo ante was essentially no Semgrep rules getting written at all (I believe the modal Semgrep user just subscribes to existing rule repositories). If a closed-loop LLM setup can successfully generate Semgrep rules for bug patterns even 5% of the time, that is a material win, and a win that comes at very little cost.

mcqueenjordan · 2025-02-16T02:05:55 1739671555

Yeah, I more or less agree about the closed loop part and the overall broader point the article was making in this context — that it may be a useful use case. I think it’s likely that process creates a lot of horseshit that passes through the process, but that might still be better than nothing for semgrep rules.

I only came down hard on that quote out of context because it felt somewhat standalone and I want to broadcast this “fluency paradox” point a bit louder because I keep running into people who really need to hear it.

I know you know what’s up.

mcqueenjordan · on Nov 26, 2024

Reliability is hard when your volume is (presumably) scaling geometrically.

paxys · on Nov 26, 2024

Can't use the "reliability is hard" excuse when you are quite literally in the business of selling reliability.

mcqueenjordan · on Nov 26, 2024

It’s just not that big of a mystery. It’s not an excuse; it’s just true. Also, they’re not especially selling reliability as much as they’re selling small geo-distributed deployments.