As always, this requires nuance. Just yesterday and today, I did exactly that to my direct reports (I'm director-level). We had gotten a bug report, and the team had collectively looked into it and believed it was not our problem, but that of an external vendor. Reported it to the vendor, who looked into it, tested it, and then pushed back and said it was our problem. My team is still more LLM-averse than me, so I had Codex look at it, and it believed it found the problem and prepared the PR. I did not review or test the PR myself, but instead assigned it to the team to validate, partly for learnings. They looked it over and agreed it was a valid fix for a problem on our side. I believe that process was better than me just fully validating it myself, and part of the process toward encouraging them to use LLM as a tool for their work.
> I believe that process was better than me just fully validating it myself
Why?
> and part of the process toward encouraging them to use LLM as a tool for their work.
Did you look at it from their perspective? You set the exact opposite example and serve as a perfect example for TFA: you did not deliver code you have proven to work. I imagine some would find this demoralizing.
I've worked with a lot of director-level software folk and many would just do the work. If they're not going to do the work, then they should probably assign someone to do it.
What if it didn't work? What if you just wasted a bunch of engineering time reviewing slop? I don't comprehend this mindset. If you're supposedly a leader, then lead.
2 decades ago, so well before any LLMs, our CEO did that with a couple of huge code changes: he hacked together a few things, and threw it over the wall to us (10K lines). I was happy I did not get assigned to deal with that mess, but getting that into production quality code took more than a month!
"But I did it in a few days, how can it take so long for you guys?" was not received well by the team.
Sure, every case is its own, and maybe here it made sense if the fix was small and testing for it was simple. Personally (also in a director-level role today), I'd rather lead by example and do the full story, including testing, and especially writing automated tests (with LLM's help or not), especially if it is small (I actually did that to fix misuse of mutexes ~12 months ago in one of our platform libraries, when everybody else was stuck when our multi-threaded code behaved as single-threaded code).
Even so, I prefer to sit with them and loudly ask questions that I'd be asking myself on the path to a fix: let them learn how I get to a solution is even more valuable, IMO.
So, I've been playing with an mcp server of my own... the api the mcp talks to is something that can create/edit/delete argument structures, like argument graphs - premises, lemmas, and conclusions. The server has a good syntactical understanding of arguments, how to structure syllogisms etc.
But it doesn't have a semantic understanding because it's not an llm.
So connecting an llm with my api via MCP means that I can do things like "can you semantically analyze the argument?" and "can you create any counterpoints you think make sense?" and "I don't think premise P12 is essential for lemma L23, can you remove it?" And it will, and I can watch it on my frontend to see how the argument evolves.
So in that sense - combining semantic understanding with tool use to do something that neither can do alone - I find it very valuable. However, if your point is that something other than MCP can do the same thing, I could probably accept that too (especially if you suggested what that could be :) ). I've considered just having my backend use an api key to call models but it's sort of a different pattern that would require me to write a whole lot more code (and pay more money).
Is there a niche "endian" humor about this :) - e.g. is this the little endian, or big endian, or "middle" endian of "Rock Paper Scissors" - excuse my really poor attempt at this.
That rings a bell, as in I think I had some classmates as a kid who called it that, and I remember thinking they were weird. I'd guess it was maybe 1:3 in favor of RPS over RSP.
I've been dealing with Codex CLI for a while and I love it, but I'm wondering if my thinking is just limited. While I'm starting discussions and creating plan docs, I've never been able to ask it to do anything that takes it longer than 25 minutes or so. Usually far less. I'm having trouble imagining what I can ask it to do that would make it take hours - like, wouldn't that require putting together an absolutely massive planning doc that would take hours to put together anyway? I'd rather just move incrementally.
Perhaps they're combining an incredibly complex product that has a lot of interactive features, a big codebase, test creation, and maybe throwing some MCP stuff in there such as creating creating a ticket in Jira if a test fails?
Easy way to get an agent to run a long time is just to get it to babysit CI/CD, tell it to iterate on it until it passes. I got Sonnet 4 to run for >6 hours that way.
The idea of giving it a task that may take six hours and reviewing it also gives me shivers.
I'm a very happy Codex customer, but everything turns to disgusting slop if I don't provide:
(1) Up-to-date AGENTS.md and an excellent prompt
(2) A full file-level API with function signatures, return types and function-level guidance if it's a complex one
(3) Multiple rounds of feedback until the result is finely sculpted
Overall it's very small units of work - one file or two, tops.
I've been letting the above standards go for the last couple of weeks due to crunch and looking at some of the hotspots of slop now lying around has me going all Homelander-face [1] at the sight of them.
Those hotspots are a few hundred lines in the worst cases; I'm definitely not ready to deal with the fallout of any unit of work that takes even more than 20min.
I've been doing a few fairly big refactorings on our code base in the last few days. It does a decent job and I generally don't put a lot of effort in my prompts.
It seems to pick a lot up from my code base. I do have an Agents.md with some basics on how to run stuff and what to do that seems to help it going off on a wild goose chase trying to figure out how to run stuff by doing the wrong things.
I think from first using codex around July to now has been quite a journey where it improved a lot. It actually seems to do well in larger code bases where it has a lot of existing structure and examples of how things are done in that code base. A lot of things it just does without me asking for them just because there's a lot of other code that does it that way.
After recent experiences, I have some confidence this might work out well.
Wonder how it compares with elkjs? I had written a few things that wrapped dot, which felt clunky, but have since had good success writing react components that use elkjs. Elkjs has a stupid amount of configuration options, some of which aren't very clearly documented, but I've been able to get decent output for use cases like argument mapping and choose-your-own-adventure maps.
I went to a bar a few years ago and there was someone set up at a table in the corner with a weird-looking deck of cards and a sign that said, "help me test my game design!" That particular approach worked really well, but I think particularly because playing a game in a bar is fun.
Data point: I run a site where users submit a record. There was a request months ago to allow users to edit the record after submitting. I put it off because while it's an established pattern it touches a lot of things and I found it annoying busy work and thus low priority. So then gpt5-codex came out and allowed me to use it in codex cli with my existing member account. I asked it to support edit for that feature all the way through the backend with a pleasing UI that fit my theme. It one-shotted it in about ten minutes. I asked for one UI adjustment that I decided I liked better, another five minutes, and I reviewed and released it to prod within an hour. So, you know, months versus an hour.
He's referring to the reality that AI helps you pick up and finish tasks that you otherwise would have put off. I see this all day every day with my side projects as well as security and customer escalations that come into my team. It's not that Giant Project X was done six times as fast. It's more like we were able to do six small but consequential bug fixes and security updates while we continued to push on the large feature.
“If you make a machine that can wash the dishes in an hour, is that more productive than not doing the dishes for months?” - Yes! That’s the definition of productivity!! The dishes are getting done now and they weren’t before! lol
Too bad it's misogynistic. I'm not sure you already knew that. If I were rude enough to call you a name, I wonder what term I could use that would work either way!
For those who don't know, OpenAI Codex CLI will now work with your ChatGPT plus or pro account. They barely announced it but it's on their github page. You don't have to use an api key.
I felt like it was getting somewhere and then it pivoted to the stupid graph thing, which I can't seem to escape. Anyway, I think it'll be really interesting to see how this settles out over the next few weeks, and how that'll contrast to what the 24-hour response has been.
My own very naive and underinformed sense: OpenAI doesn't have other revenue paths to fall back on like Google does. The GPT5 strategy really makes sense to me if I look at this as a market share strategy. They want to scale out like crazy, in a way that is affordable to them. If it's that cheap, then they must have put a ton of work in to some scaling effort that the other vendors just don't care about as much, whether due to loss-leader economics or VC funding. It really makes me wonder if OpenAI is sitting on something much better that also just happens to be much, much more expensive.
Overall, I'm weirdly impressed because if that was really their move here, it's a slight evidence point that shows that somewhere down in their guts, they do really seem to care about their original mission. For people other than power users, this might actually be a big step forward.
I agree that they don't have other revenue paths and think that's a big issue for them. I disagree that this means they care about their original mission though; I think they're struggling to replicate their original insane step function model improvements and other players have caught up.
reply