I hope better and cheaper models will be widely available because competition is good for the business.
However, I'm more cautious about benchmark claims. MiniMax 2.1 is decent, but one can really not call it smart. The more critical issue is that MiniMax 2 and 2.1 have the strong tendency to reward hacking, often write nonsensical test report while the tests actually failed. And sometimes it changed the existing code base to make its new code "pass", when it actually should fix its own code instead.
Artificial Analysis put MiniMax 2.1 Coding index on 33, far behind frontier models and I feel it's about right. [1]
That's what I found with some of these LLM models as well. For example I still like to test those models with algorithm problems, and sometimes when they can't actually solve the problem, they will start to hardcode the test cases into the algorithm itself.. Even DeepSeek was doing this at some point, and some of the most recent ones still do this.
I have asked GLM4.7 in opencode to make an application to basically filter a couple of spatial datasets hosted at a url I provided it, and instead of trying to download read the dataset, it just read the url, assumed what the datasets were (and got it wrong) is and it's shape (and got it wrong) and the fields (and got it wrong) and just built an application based on vibes that was completely unfixable.
It wrote an extensive test suite on just fake data and then said the app is perfectly working as all tests passed.
This is a model that was supposed to match sonnet 4.5 in benchmarks. I don't think sonnet would be that dumb.
I use LLMs a lot to code, but these chinese models don't match anthropic and openai in being able to decide stuff for themselves. They work well if you give them explicit instructions that leaves little for it to mess up, but we are slowly approaching where OpenAI and anthropic models will make the right decisions on their own
this aligns perfecly with my experience, but of course, the discourse on X and other forums are filled with people who are not hands on. Marketing is first out of the gate. These models are not yet good enough to be put through a long coding session. They are getting better though! GLM 4.7 and Kimi 2.5 are alright.
It really is infuriatingly dumb; like a junior who does not know English. Indeed, it often transitions into Chinese.
Just now it added some stuff to a file starting at L30 and I said "that one line L30 will do remove the rest", it interpreted 'the rest' as the file, and not what it added.
Sounds exactly what a junior-dev would do without proper guidance. Could better direction in the prompts help? I find I frequently have to tell it where to put what fixes. IME they make a lot of spaghetti (LLMs and juniors)
> And sometimes it changed the existing code base to make its new code "pass", when it actually should fix its own code instead.
I haven’t tried MiniMax, but GPT-5.2-Codex has this problem. Yesterday I watched it observe a Python type error (variable declared with explicit incorrect type — fix was trivial), and it added a cast. (“cast” is Python speak for “override typing for this expression”.) I told it to fix it for real and not use cast. So it started sprinkling Any around the program (“Any” is awful Python speak for “don’t even try to understand this value and don’t warn either”).
Even Claude opus 4.6 is pretty willing to start tearing apart my tests or special-case test values if it doesn't find a solution quickly (and in c++/rust land a good proportion of its "patience" seems to be taken up just getting things that compile)
I’ve found that GPT-5.2 is shockingly good at producing code that compiles, despite also being shockingly good at not even trying to compiling it and instead asking me whether I want it to compile the code.
I'm pretty certain that DeepMind (and all other labs) will try their frontier (and even private) models on First Proof [1].
And I wonder how Gemini Deep Think will fare. My guess is that it will get half the way on some problems. But we will have to take an absence as a failure, because nobody wants to publish a negative result, even though it's so important for scientific research.
5 days for Ai is by no mean short! If it can solve it, it would need perhaps 1-2 hours. If it can not, 5 days continuous running would produce gibberish only. We can safely assume that such private models will run inferences entirely on dedicated hardware, sharing with nobody. So if they could not solve the problems, it's not due to any artificial constraint or lack of resources, far from it.
The 5 days window, however, is a sweat spot because it likely prevents cheating by hiring a math PhD and feed the AI with hints and ideas.
That's not really how it works, the recent Erdos proofs in Lean were done by a specialized proprietary model (Aristotle by Harmonic) that's specifically trained for this task. Normal agents are not effective.
Why did you omit the other AI-generated Erdos proofs not done by a proprietary model, which occurred on timescales stretched across significantly longer time than 5 days?
Those were not really "proofs" by the standard of 1stproof. The only way an AI can possibly convince an unsympathetic peer reviewer that its proof is correct is to write it completely in a formal system like Lean. The so-called "proofs" done with GPT were half baked and required significant human input, hints, fixing after the fact etc. which is enough to disqualify them from this effort.
That wasn't my recollection. The individual who generated one of the proofs did a write-up for his methodology and it didn't involve a human correcting the model.
This is exactly the kind of challenge I would want to judge AI systems based on. It required ten bleeding-edge-research mathematicians to publish a problem they've solved but hold back the answer. I appreciate the huge amount of social capital and coordination that must have taken.
Of course it isn't made the front page. If something is promising they hunt it down, and when conquered they post about it. Lot of times the new category has much better results, than the default HN view.
As the focus here is solely on the US, and the comments focus too much on the impossibility of heat dissipation, I want to include some information to broaden the perspective.
- In the EU, the ASCEND study conducted in 2024 by Thales Alenia Space found that data center in space could be possible by 2035. Data center in space could contribute to the EU's Net-Zero goal by 2050 [1]
- heat dissipation could be greatly enhanced with micro droplet technology, and thereby reducing the required radiator surface area by the factor of 5-10
- data center in space could provide advantages for processing space data, instead of sending them all to earth.
- the Lonestar project proved that data storage and edge processing in space (moon, cislunar) is possible.
- A hybrid architecture could dramatically change the heat budget:
+ optical connections reduce heat
+ photonic chips (Lightmatter and Q.ANT)
+ processing-in-memory might reduce energy requirement by 10-50 times
I think the hybrid architecture could provide decisive advantages, especially when designed for AI inference workloads,
> Data center in space could contribute to the EU's Net-Zero goal by 2050
How unbelievably crass. "Let's build something out of immense quantities of environmentally-destructive-to-extract materials and shoot it into space on top of gargantuan amounts of heat and greenhouse gas emissions; since it won't use much earth-sourced energy once it's up there, that nets out to a win!"
Where do they get the hydrogen without putting a load of CO2 into the atmosphere just to manufacture the hydrogen to begin with?
One thing to think about is debt which is not in terms of money.
People are becoming more familiar with "technical debt" since otherwise it comes due by surprise.
With hamsterwheels in space you've got energy debt.
Separate from all other forms of debt that are involved.
Like financial debt, which is only a problem if you can't really afford to do the project so you have to beg, borrow, and/or steal to get it going.
On that point I think I'd be a little skeptical if the richest known person can't actually afford this easily. Especially if he really wants it with all his heart, and has put in any worthwhile effort so far.
Anyway, solar cells are kind of weak when you think about it, they don't produce the high output of a suitable chemical reaction, like the kind that launches the rockets themselves. Which releases so much energy so fast that it's always going to take a serious amount of time for the "little" solar cells to have finally produced an equal amount of energy before a net positive can begin to accrue.
Keeping the assets safely on the home planet simply provides a jump-start that can not be matched.
Flora and fauna near launch sites is not a battle you are going to win. The next space race will need more launch sites, not fewer and we're gonna have to accept a negative impact around those sites.
> A hybrid architecture could dramatically change the heat budget: + optical connections reduce heat + photonic chips (Lightmatter and Q.ANT) + processing-in-memory might reduce energy requirement by 10-50 times
It would also make ground-based computation more efficient by the same amount. That does nothing to make space datacenters make sense.
They do have to be if they want to be approved by the FCC.
And btw Kessler syndrome applies to any orbital band. You've got the logic backwards. Kessler syndrome is usually only considered a threat for LEO because that's where most of the satellites are. But if you're throwing million(s) of satellites into orbit, it becomes an issue at whatever orbital height you pick.
Is sort of somewhat handling 10,000 by enabling them to make orbital adjustments more quickly. By the time you have a million, you will run out of prop way too quick.
Sorry I misspoke, you're totally correct. What I meant to say was it's only a problem if they're orbiting around the Earth. I've heard sun orbits mentioned as a possibility for data centers.
It would still be a space junk problem. Space is big, but amazingly not that big. If you start ejecting little hot BBs at interplanetary speeds, you are creating broad swath of buckshot that will eventually impact something with the force of a missile. Put millions of these satellites into solar orbits (I’m ignoring the huge increase in launch cost this would require, and all the other issues like latency and comms), and you could very well make trips to other planets impossible.
It wouldn’t be Kessler syndrome as you would not have a chain reaction of collisions, but the end result would be the same.
Yeah if you leave enough junk in any orbit it'll become a problem, but I don't think that's necessarily an argument not to put things in that orbit. You'd just need to not hit that critical limit where things become untenable.
This is only relevant to the compute productivity (how much useful work it can produce), but it's irrelevant to the heat dissipation problem. The energy income is fundamentally limited by the solar facing area (x 1361 W/m^2). So the energy output cannot exceed it, regardless useful signals or just waste heat. Even if we just put a stone there, the equilibrium temperature wouldn't be any better or worse.
The danger is that these nuanced, legitimate use cases get rhetorically stretched to support much bigger claims. That's where skepticism kicks back in.
It’s so hard to understand that the foreign staff are now afraid for their safety and their lives?
After the killing of Pretti (execution is probably the more correct word), I guess even some US staff can not be so sure about what would happen to them.
__“But are there not many fascists in your country?"
"There are many who do not know – but will find it out when the time comes.”__
What a wonderful teacher! I wish all teachers were like him.
Regarding the collaboration before the exam, it's really strange. In our generation, asking or exchanging questions was perfectly normal. I got an almost perfect score in physics thanks to that. I guess the elegant solution was still in me, but I might not have been able to come up with it in such a stressful situation. 'Almost' because the professor deducted one point from my score for being absent too often :)
However, oral exams in Europe are quite different from those at US universities. In an oral exam, the professor can interact with the student to see if they truly understand the subject, regardless of the written text. Allowing a chatbot during a written exam today would be defying the very purpose of the exam.
My experience is exactly the opposite. With AI, it's better to start small and simplify as much as possible. Once you have working code, refactor and abstract it as you deem fit, documenting along the way. Not the other way around. In a world abound of imitations and perfect illusions, code is the crucial reality to which you need to anchor yourself, not documents.
But that’s just me, and I'm not trying to convince anyone.
in my experience you do both. small ai spike demos to prove a specific feature or logic, then top-down assemble them into a superstructure. The difference is that I do the spikes on pure vibe, while reserving my design planning for the big system.
Artificial Analysis put MiniMax 2.1 Coding index on 33, far behind frontier models and I feel it's about right. [1]
[1] https://artificialanalysis.ai/models/minimax-m2-1
reply