For anyone getting a wrong answer from reasoning models, try adding "This might be a trick question, don't just go with your first instinct, really think it through" and see if it helps. Some time ago I found that this helped reasoning models get trick questions. (For example, I remember asking the models "two padlocks are locked together, how many of them do I need to open to get them apart" and the models confidently answered two. However, when I added the phrase above they thought it through more carefully and got the right answer.)
I agree with this article completely, nice to see it presented quantitatively.
>re "only" the harness changed
In our experience, AI's are like amnesiacs who can barely remember what they did three minutes ago (their last autonomous actions might still be in their context if you're lucky), with no chance at remembering what they did three days ago. As such, the "harness" determines their entire memory and is the single most important determinant of their outcome.
The best harness is a single self-contained, well-commented, obvious, and tiny code file followed by a plain explanation of what it does and what it's supposed to do, the change request, how you want it to do it (you have to say it with so much force and confidence that the AI is afraid of getting yelled at if they do anything else) and a large amount of text devoted to asking the AI not to break what is already working. Followed by a request to write a test that passes. Followed by asking for its judgment about whether it broke what was already working on or not. All in one tiny crisp prompt.
With such a harness, it's able to not break the code one time in twenty. If you use reverse psychology and ask it to do the opposite of what you want, it rises to fifty-fifty odds you'll get what you're trying to do.
Don't believe me? You can watch the livestream (see my previous comments).
Does this have an alternative market-based solution?
I see a big problem with job scams from "legitimate" companies advertising jobs they have no plans to fulfill or which do not exist at all. They seem to do this because they employ HR and recruiters, and it looks good if job openings are on their site. It's a real problem.
I believe there could be a solution: charge $500 to post an opening and give $50 to each of ten AI-vetted qualified candidates. That means the candidate had a real job interview and passed it without AI assistance. They get $50 to interview.
If the company doesn't hire any of ten qualified candidates, then they can pay another $500, or stop pretending that the job they're advertising for is real.
What do you think of this novel model?
On the candidate side, it allows them to interview once and then only be called into a job interview if they're being paid $50 to do so.
The system could also go the other way and prevent candidates from making interviewing their living, by only inviting them to a limited number of interviews with job offers on the table - such as three such interviews job offers. After that they don't get more job offers since they're not really on the market.
The AI could also represent salary reality more transparently.
Basically, overall it should be the companies hiring that pay, not candidates who have lots of their time wasted.
What do you think of such a market-based approach?
The State of Utopia[1] is currently fine-tuning an older 1 GB model called Bitnet, so that we have something beginning to have the shape of a sovereign model that can run on the edge. We think having model sovereignty is important for our citizens, and we are working on tools for them to easily further fine-tune the model straight from their browser. We are currently running a 30-hour training run on some simple hardware and through webGPU, so that no trust or installation is required.
We made it possible to run the model in webGPU and it is pretty fast even in that environment. You can see the porting process in my last few submissions, because we livestreamed Claude Code porting the base model from the original C++ and Python.
In a separate initiative, we produced a new hash function with AI - however, although it is novel, it might not be novel enough for publication and it's unclear whether we can publish it. It has several innovations compared to other hash formats.
We are running some other developments and experiments, but don't want to promise more than we can deliver in a working state, so for more information you just have to keep checking stateofutopia.com (or stofut.com for short).
Our biggest challenge at the moment is managing Claude's use of context and versions, while working on live production installs.
Everything takes time and attention and Claude Code is far from being fully autonomous building new productive services on a server - it's not even close to being able to do that autonomously. We feel that we have to be in the loop for everything.
[1] eventual goal: technocratic utopia, will be available at stateofutopia.com
They still exist. Surprisingly, most folks aren't interested in letting every newsletter and promotion know that they were seen. So a surveillance arms race ensues instead.
In previous installments, Claude Code ported a reference c++ and python implementation of a neural net to javascript and wasm. It next ported it to webGPU, even though it does not have a gpu in its server instance. The webgpu code worked, however it produced gibberish. Now in this livestream, it is trying to debug the webGPU code by thinking through it carefully and comparing it to the reference implementation - even though it doesn't have a GPU to test its results.
Streaming live now (2:10 Pacific, 5:10 EST on Thursday, February 5, 2026), Claude Code is iteratively improving its reference bitnet implementation.[1] It previously successfully ported Bitnet from C++ and Python to its own implementation in Javascript and WASM. (It also ported it to webGPU, however it doesn't work on my computer with my GPU, and it's running on a server without a GPU at the moment, so I'm not sure if it works or not.)
Its implementation diverges from the reference implementation though, even with the same model file, and is qualitatively worse, so now it is iteratively improving its own implementation, until it matches the reference implementation.
This livestream is historically important and significant because Claude Code has been set to code completely autonomously for about ten hours. If you're wondering what it's like to watch Claude Code code in February 2026, this video can give you a good idea. (Most people code with it in an interactive session rather than asking it to work autonomously.)
It will be a very significant and important video in ten years, when we can look back at the State of the Art in 2026.
Here is its working implementation which is its baseline and starting point.[2]
reply