More

sinuhe69 · 2026-02-12T18:25:05 1770920705

I hope better and cheaper models will be widely available because competition is good for the business. However, I'm more cautious about benchmark claims. MiniMax 2.1 is decent, but one can really not call it smart. The more critical issue is that MiniMax 2 and 2.1 have the strong tendency to reward hacking, often write nonsensical test report while the tests actually failed. And sometimes it changed the existing code base to make its new code "pass", when it actually should fix its own code instead.

Artificial Analysis put MiniMax 2.1 Coding index on 33, far behind frontier models and I feel it's about right. [1]

[1] https://artificialanalysis.ai/models/minimax-m2-1

osti · 2026-02-12T18:35:22 1770921322

That's what I found with some of these LLM models as well. For example I still like to test those models with algorithm problems, and sometimes when they can't actually solve the problem, they will start to hardcode the test cases into the algorithm itself.. Even DeepSeek was doing this at some point, and some of the most recent ones still do this.

qinsignificance · 2026-02-12T19:03:03 1770922983

I have asked GLM4.7 in opencode to make an application to basically filter a couple of spatial datasets hosted at a url I provided it, and instead of trying to download read the dataset, it just read the url, assumed what the datasets were (and got it wrong) is and it's shape (and got it wrong) and the fields (and got it wrong) and just built an application based on vibes that was completely unfixable.

It wrote an extensive test suite on just fake data and then said the app is perfectly working as all tests passed.

This is a model that was supposed to match sonnet 4.5 in benchmarks. I don't think sonnet would be that dumb.

I use LLMs a lot to code, but these chinese models don't match anthropic and openai in being able to decide stuff for themselves. They work well if you give them explicit instructions that leaves little for it to mess up, but we are slowly approaching where OpenAI and anthropic models will make the right decisions on their own

hsaliak · 2026-02-12T22:43:20 1770936200

this aligns perfecly with my experience, but of course, the discourse on X and other forums are filled with people who are not hands on. Marketing is first out of the gate. These models are not yet good enough to be put through a long coding session. They are getting better though! GLM 4.7 and Kimi 2.5 are alright.

esafak · 2026-02-12T19:29:27 1770924567

It really is infuriatingly dumb; like a junior who does not know English. Indeed, it often transitions into Chinese.

Just now it added some stuff to a file starting at L30 and I said "that one line L30 will do remove the rest", it interpreted 'the rest' as the file, and not what it added.

edoceo · 2026-02-12T18:55:28 1770922528

Sounds exactly what a junior-dev would do without proper guidance. Could better direction in the prompts help? I find I frequently have to tell it where to put what fixes. IME they make a lot of spaghetti (LLMs and juniors)

throawayonthe · 2026-02-12T21:10:11 1770930611

wtf kinda juniors are you interacting with

edoceo · 2026-02-12T21:59:27 1770933567

Lots of self-taught; looking for an entry level.

alsetmusic · 2026-02-13T03:54:21 1770954861

I'm self-taught and I've always understood that adjusting tests to cheat is a fail.

amluto · 2026-02-12T23:09:59 1770937799

> And sometimes it changed the existing code base to make its new code "pass", when it actually should fix its own code instead.

I haven’t tried MiniMax, but GPT-5.2-Codex has this problem. Yesterday I watched it observe a Python type error (variable declared with explicit incorrect type — fix was trivial), and it added a cast. (“cast” is Python speak for “override typing for this expression”.) I told it to fix it for real and not use cast. So it started sprinkling Any around the program (“Any” is awful Python speak for “don’t even try to understand this value and don’t warn either”).

kimixa · 2026-02-13T05:53:53 1770962033

Even Claude opus 4.6 is pretty willing to start tearing apart my tests or special-case test values if it doesn't find a solution quickly (and in c++/rust land a good proportion of its "patience" seems to be taken up just getting things that compile)

amluto · 2026-02-13T21:26:09 1771017969

I’ve found that GPT-5.2 is shockingly good at producing code that compiles, despite also being shockingly good at not even trying to compiling it and instead asking me whether I want it to compile the code.

whattheheckheck · 2026-02-13T17:15:23 1771002923

Or it uses type ignore comments

XCSme · 2026-02-12T19:10:00 1770923400

MiniMax 2.1 didn't really work for my data-parsing tasks, a lot of errors.

Instead, this one works surprisingly well for the cost: https://openrouter.ai/xiaomi/mimo-v2-flash

sinuhe69 · 2026-02-12T18:12:03 1770919923

I'm pretty certain that DeepMind (and all other labs) will try their frontier (and even private) models on First Proof [1].

And I wonder how Gemini Deep Think will fare. My guess is that it will get half the way on some problems. But we will have to take an absence as a failure, because nobody wants to publish a negative result, even though it's so important for scientific research.

[1] https://1stproof.org/

blinding-streak · 2026-02-13T08:05:44 1770969944

As a non-mathematician, reading these problems feels like reading a completely foreign language.

https://arxiv.org/html/2602.05192v1

ky3 · 2026-02-13T17:55:17 1771005317

LLM to the rescue. Feed in a problem and ask it to explain it to a layperson. Also feed in sentences that remain obscure and ask to unpack.

zozbot234 · 2026-02-12T18:27:26 1770920846

The 1st proof original solutions are due to be published in about 24h, AIUI.

energy123 · 2026-02-13T04:37:03 1770957423

Feels like an unforced blunder to make the time window so short after going to so much effort and coming up with something so useful.

sinuhe69 · 2026-02-13T10:06:50 1770977210

5 days for Ai is by no mean short! If it can solve it, it would need perhaps 1-2 hours. If it can not, 5 days continuous running would produce gibberish only. We can safely assume that such private models will run inferences entirely on dedicated hardware, sharing with nobody. So if they could not solve the problems, it's not due to any artificial constraint or lack of resources, far from it.

The 5 days window, however, is a sweat spot because it likely prevents cheating by hiring a math PhD and feed the AI with hints and ideas.

energy123 · 2026-02-13T10:27:41 1770978461

5 days is short for memetic propagation on social media to reach everyone who has their own harness and agentic setup that wants to have a go.

zozbot234 · 2026-02-13T13:34:18 1770989658

That's not really how it works, the recent Erdos proofs in Lean were done by a specialized proprietary model (Aristotle by Harmonic) that's specifically trained for this task. Normal agents are not effective.

energy123 · 2026-02-13T14:06:00 1770991560

Why did you omit the other AI-generated Erdos proofs not done by a proprietary model, which occurred on timescales stretched across significantly longer time than 5 days?

zozbot234 · 2026-02-13T14:55:10 1770994510

Those were not really "proofs" by the standard of 1stproof. The only way an AI can possibly convince an unsympathetic peer reviewer that its proof is correct is to write it completely in a formal system like Lean. The so-called "proofs" done with GPT were half baked and required significant human input, hints, fixing after the fact etc. which is enough to disqualify them from this effort.

energy123 · 2026-02-14T09:28:03 1771061283

That wasn't my recollection. The individual who generated one of the proofs did a write-up for his methodology and it didn't involve a human correcting the model.

octoberfranklin · 2026-02-12T22:44:35 1770936275

Really surprised that 1stproof.org was submitted three times and never made front page at HN.

https://hn.algolia.com/?q=1stproof

This is exactly the kind of challenge I would want to judge AI systems based on. It required ten bleeding-edge-research mathematicians to publish a problem they've solved but hold back the answer. I appreciate the huge amount of social capital and coordination that must have taken.

I'm really glad they did it.

lofaszvanitt · 2026-02-13T07:45:04 1770968704

Of course it isn't made the front page. If something is promising they hunt it down, and when conquered they post about it. Lot of times the new category has much better results, than the default HN view.

sinuhe69 · 2026-02-12T00:29:27 1770856167

I’d argue it’s the greatest nightmare and the ultimate contempt for human life and values.

pcj-github · 2026-02-12T00:36:39 1770856599

Compared to the judicial landscape we're facing in the US right now, it sounds like a safeguard.

Until this administration forces OpenAI to comply by secret government LLM training protocols that is...

sinuhe69 · 2026-02-09T16:50:51 1770655851

Can you explain the part why it's not possible to bypass? Could the kids not simply remove the app?

sinuhe69 · 2026-02-04T03:31:53 1770175913

As the focus here is solely on the US, and the comments focus too much on the impossibility of heat dissipation, I want to include some information to broaden the perspective.

- In the EU, the ASCEND study conducted in 2024 by Thales Alenia Space found that data center in space could be possible by 2035. Data center in space could contribute to the EU's Net-Zero goal by 2050 [1]

- heat dissipation could be greatly enhanced with micro droplet technology, and thereby reducing the required radiator surface area by the factor of 5-10

- data center in space could provide advantages for processing space data, instead of sending them all to earth. - the Lonestar project proved that data storage and edge processing in space (moon, cislunar) is possible.

- A hybrid architecture could dramatically change the heat budget: + optical connections reduce heat + photonic chips (Lightmatter and Q.ANT) + processing-in-memory might reduce energy requirement by 10-50 times

I think the hybrid architecture could provide decisive advantages, especially when designed for AI inference workloads,

[1] https://ascend-horizon.eu/

zbentley · 2026-02-04T14:29:19 1770215359

> Data center in space could contribute to the EU's Net-Zero goal by 2050

How unbelievably crass. "Let's build something out of immense quantities of environmentally-destructive-to-extract materials and shoot it into space on top of gargantuan amounts of heat and greenhouse gas emissions; since it won't use much earth-sourced energy once it's up there, that nets out to a win!"

Insane.

indoordin0saur · 2026-02-04T18:30:25 1770229825

Blue Origin at least runs its rockets on hydrogen whose exhaust is only water.

fuzzfactor · 2026-02-04T19:44:13 1770234253

Where do they get the hydrogen without putting a load of CO2 into the atmosphere just to manufacture the hydrogen to begin with?

One thing to think about is debt which is not in terms of money.

People are becoming more familiar with "technical debt" since otherwise it comes due by surprise.

With hamsterwheels in space you've got energy debt.

Separate from all other forms of debt that are involved.

Like financial debt, which is only a problem if you can't really afford to do the project so you have to beg, borrow, and/or steal to get it going.

On that point I think I'd be a little skeptical if the richest known person can't actually afford this easily. Especially if he really wants it with all his heart, and has put in any worthwhile effort so far.

Anyway, solar cells are kind of weak when you think about it, they don't produce the high output of a suitable chemical reaction, like the kind that launches the rockets themselves. Which releases so much energy so fast that it's always going to take a serious amount of time for the "little" solar cells to have finally produced an equal amount of energy before a net positive can begin to accrue.

Keeping the assets safely on the home planet simply provides a jump-start that can not be matched.

All other things being unequal or not.

zbentley · 2026-02-04T22:02:18 1770242538

And heat and pressure. Negligible amounts in terms of the biosphere, but not in terms of flora and fauna in proximity to launch sites.

sgt · 2026-02-05T10:48:27 1770288507

Flora and fauna near launch sites is not a battle you are going to win. The next space race will need more launch sites, not fewer and we're gonna have to accept a negative impact around those sites.

adastra22 · 2026-02-04T04:59:26 1770181166

> micro droplet technology

Intentionally causing Kessler Syndrome?

> A hybrid architecture could dramatically change the heat budget: + optical connections reduce heat + photonic chips (Lightmatter and Q.ANT) + processing-in-memory might reduce energy requirement by 10-50 times

It would also make ground-based computation more efficient by the same amount. That does nothing to make space datacenters make sense.

rockemsockem · 2026-02-04T06:32:09 1770186729

Kessler syndrome is only a problem if the satellites are in LEO. They don't have to be.

adastra22 · 2026-02-04T07:38:13 1770190693

They do have to be if they want to be approved by the FCC.

And btw Kessler syndrome applies to any orbital band. You've got the logic backwards. Kessler syndrome is usually only considered a threat for LEO because that's where most of the satellites are. But if you're throwing million(s) of satellites into orbit, it becomes an issue at whatever orbital height you pick.

jraby3 · 2026-02-04T08:36:45 1770194205

https://starlink.com/updates/stargaze

philipwhiuk · 2026-02-04T13:16:48 1770211008

Is sort of somewhat handling 10,000 by enabling them to make orbital adjustments more quickly. By the time you have a million, you will run out of prop way too quick.

rockemsockem · 2026-02-05T18:25:04 1770315904

Sorry I misspoke, you're totally correct. What I meant to say was it's only a problem if they're orbiting around the Earth. I've heard sun orbits mentioned as a possibility for data centers.

adastra22 · 2026-02-05T19:20:17 1770319217

It would still be a space junk problem. Space is big, but amazingly not that big. If you start ejecting little hot BBs at interplanetary speeds, you are creating broad swath of buckshot that will eventually impact something with the force of a missile. Put millions of these satellites into solar orbits (I’m ignoring the huge increase in launch cost this would require, and all the other issues like latency and comms), and you could very well make trips to other planets impossible.

It wouldn’t be Kessler syndrome as you would not have a chain reaction of collisions, but the end result would be the same.

rockemsockem · 2026-02-07T22:21:59 1770502919

Yeah if you leave enough junk in any orbit it'll become a problem, but I don't think that's necessarily an argument not to put things in that orbit. You'd just need to not hit that critical limit where things become untenable.

typ · 2026-02-04T07:58:44 1770191924

> reduce energy requirement by 10-50 times

This is only relevant to the compute productivity (how much useful work it can produce), but it's irrelevant to the heat dissipation problem. The energy income is fundamentally limited by the solar facing area (x 1361 W/m^2). So the energy output cannot exceed it, regardless useful signals or just waste heat. Even if we just put a stone there, the equilibrium temperature wouldn't be any better or worse.

MarceliusK · 2026-02-05T17:29:03 1770312543

The danger is that these nuanced, legitimate use cases get rhetorically stretched to support much bigger claims. That's where skepticism kicks back in.

sinuhe69 · 2026-01-28T02:07:50 1769566070

It’s so hard to understand that the foreign staff are now afraid for their safety and their lives?

After the killing of Pretti (execution is probably the more correct word), I guess even some US staff can not be so sure about what would happen to them.

__“But are there not many fascists in your country?"

"There are many who do not know – but will find it out when the time comes.”__

sinuhe69 · 2026-01-27T07:05:30 1769497530

My general take on any AI/ML in medicine is that without a proper clinical validation, they are not worth to try. Also, AI Snake Oil is worth reading.

rubatuga · 2026-01-27T08:00:15 1769500815

Clinical validation, proper calibration, ethnic and community and population variants, questioning technique and more ...

joelthelion · 2026-01-27T09:54:24 1769507664

Exactly. There's a lot of potential, but it needs to be done right, otherwise it is worse than useless.

sinuhe69 · 2026-01-20T14:38:56 1768919936

What a wonderful teacher! I wish all teachers were like him.

Regarding the collaboration before the exam, it's really strange. In our generation, asking or exchanging questions was perfectly normal. I got an almost perfect score in physics thanks to that. I guess the elegant solution was still in me, but I might not have been able to come up with it in such a stressful situation. 'Almost' because the professor deducted one point from my score for being absent too often :)

However, oral exams in Europe are quite different from those at US universities. In an oral exam, the professor can interact with the student to see if they truly understand the subject, regardless of the written text. Allowing a chatbot during a written exam today would be defying the very purpose of the exam.

sinuhe69 · 2026-01-20T02:59:06 1768877946

Pyret, a teaching language for CS, in the vein of Racket, does require testing by writing functions.

https://pyret.org/docs/latest/testing.html

sinuhe69 · 2026-01-14T14:33:24 1768401204

My experience is exactly the opposite. With AI, it's better to start small and simplify as much as possible. Once you have working code, refactor and abstract it as you deem fit, documenting along the way. Not the other way around. In a world abound of imitations and perfect illusions, code is the crucial reality to which you need to anchor yourself, not documents.

But that’s just me, and I'm not trying to convince anyone.

FeepingCreature · 2026-01-28T06:49:08 1769582948

in my experience you do both. small ai spike demos to prove a specific feature or logic, then top-down assemble them into a superstructure. The difference is that I do the spikes on pure vibe, while reserving my design planning for the big system.