1. Take every single function, even private ones.
2. Mock every argument and collaborator.
3. Call the function.
4. Assert the mocks were called in the expected way.
These tests help you find inadvertent changes, yes, but they also create constant noise about changes you intend.
Juniors on one of the teams I work with only write this kind of tests. It’s tiring, and I have to tell them to test the behaviour, not the implementation. And yet every time they do the same thing. Or rather their AI IDE spits these out.
> Everything it does can be done reasonable well with list comprehensions and objects that support type annotations and runtime type checking (if needed).
I see this take somewhat often, and usually with similar lack of nuance. How do you come to this? In other cases where I've seen this it's from people who haven't worked in any context where performance or scientific computing ecosystem interoperability matters - missing a massive part of the picture. I've struggled to get through to them before. Genuine question.
It does largely avoid the issue if you configure to allow only specific environments AND you require reviews before pushing/merging to branches in that environment.
Yes and anyone who knows anything about software dev knows that the first thing you should do with an important repo is set up branch protections to disallow that, and require reviews etc. Basic CI/CD.
This incident reflects extremely poorly on PostHog because it demonstrates a lack of thought to security beyond surface level. It tells us that any dev at PostHog has access at any time to publish packages, without review (because we know that the secret to do this is accessible from plain GHA secret which can be read from any GHA run which presumably run on any internal dev's PR). The most charitable interpretation of this is that it's consciously justified by them because it reduces friction, in which case I would say that demonstrates poor judgement, a bad balance.
A casual audit would have revealed this and suggested something like restricting the secret to a specific GHA environment and requiring reviews to push to that env. Or something like that.
I've found structured output APIs to be a pain across various LLMs. Now I just ask for json output and pick it out between first/last curly brace. If validation fails just retry with details about why it was invalid. This works very reliably for complex schemas and works across all LLMs without having to think about limitations.
And then you can add complex pydantic validators (or whatever, I use pydantic) with super helpful error messages to be fed back into the model on retry. Powerful pattern
Significant, check any Claude related thread here over the last month or the Claude Code subreddit. Anecdotally, the degradation has been so bad that I had to downgrade to a month old version - which has helped a lot. I think part of the problem is there as well (Claude Code).
We operate a saas where a common step is inputting rates of widgets in $/widget, $/widget/day, $/1kwidgets, etc etc. These are incredibly tedious and error prone to enter. And usually the source of these rates is an invoice which presents them in ambiguous ways e.g. rows with "quantity" and "charge" from which you have to back calculate the rate. And these invoices are formatted in all different ways.
We offer a feature to upload the invoice and we pull out all the rates for you. Uses LLMs under the hood. Fundamentally it's a "chatgpt wrapper" but there's a massive amount of work in tweaking the prompts based on evals, splitting things up into multiple calls, etc.
And it works great! Niche software, but for power users were saving them tens of minutes of monotonous work per day and in all likelihood entering things more accurate. This complements the manual entry process with full ability to review the results. Accuracy is around 98-99 percent.
I gave it a shot just now with a fairly simple refactor. +19 lines, -9 lines, across two files. Totally ballsed it up. Defined one of the two variables it was meant to, referred to the non-implemented one. I told it "hey you forgot the second variable" and then it went and added it in twice. Added comments (after prompting it to) which were half-baked, ambiguous when read in context.
Never had anything like this with claude code.
I've used Gemini 2.5 Pro quite a lot and like most people I find it's very intelligent. I've bent over backwards to use Gemini 2.5 Pro in another piece of work because it's so good. I can only assume it's the gemini CLI itself that's using the model poorly. Keen to try again in a month or two and see if this poor first impression is just a teething issue.
I told it that it did a pretty poor job and asked it why it thinks that is, told it that I know it's pretty smart. It gave me a wall of text and I asked for the short summary
> My tools operate on raw text, not the code's structure, making my edits brittle and prone to error if the text patterns aren't perfect. I lack a persistent, holistic view of the code like an IDE provides, so I can lose track of changes during multi-step tasks. This led me to make simple mistakes like forgetting a calculation and duplicating code.
I noticed a significant degradation of Gemini's coding abilities in the last couple checkpoints of 2.5. the benchmarks say it should be better but it doesn't jive with my personal experience.