More

sweezyjeezy · 2025-12-12T14:11:09 1765548669

It's also really hard to make the tunnel remain a tunnel over its expected 150 year lifespan - given that it basically runs through a fault line. They had to study and test local geology for about 15 years, build certain sections to expect some movement over time, as well as kit everything out with a lot of sensors.

Overall an amazing achievement, and unsurprising it took this long to figure out!

monster_truck · 2025-12-12T15:17:49 1765552669

After seeing some of the safety features in a short video I linked in another comment, I get the impression that this is either going to last much longer than 150 years or something so catastrophic will happen that nothing that could have been built would've persisted.

sweezyjeezy · 2025-12-08T13:27:41 1765200461

You could make an LLM deterministic if you really wanted to without a big loss in performance (fix random seeds, make MoE batching deterministic). That would not fix hallucinations.

I don't think using deterministic / stochastic as a diagnostic is accurate here - I think that what we're really talking is about some sort of fundamental 'instability' of LLMs a la chaos theory.

encyclopedism · 2025-12-08T23:20:54 1765236054

Hallucinations can never be fixed. LLM's 'hallucinate' because that is literally what they can ONLY do, provide some output given some input. The output is measured and judged by a human who then classifies it as 'correct' or 'incorrect'. In the later case it seems to be labelled as a 'hallucination' as if it did something wrong. It did nothing wrong and worked exactly as it was programmed to do.

rs186 · 2025-12-08T13:43:26 1765201406

We talk about "probability" here because the topic is hallucination, not getting different answers each time you ask the same question. Maybe you could make the output deterministic but does not help with the hallucination problem at all.

sweezyjeezy · 2025-12-08T14:07:12 1765202832

Exactly - 'non-deterministic' is not an accurate diagnosis of the issue.

ajuc · 2025-12-08T13:48:07 1765201687

Yeah deterministic LLMs just hallucinate the same way every time.

sweezyjeezy · 2025-07-21T19:45:57 1753127157

100% o3 has a strong bias towards "write something that looks like a formal argument that appears to answer the question" over writing something sound.

I gave it a bunch of recent, answered MathOverflow questions - graduate level maths queries. Sometimes it would get demonstrably the wrong answer, but it not be easy to see where it had gone wrong (e.g. some mistake in a morass of algebra). A wrong but convincing argument is the last thing you want!

sweezyjeezy · 2025-07-21T19:35:55 1753126555

Gemini is clearer but MY GOD is it verbose. e.g. look at problem 1, section 2. Analysis of the Core Problem - there's nothing at all deep here, but it seems the model wants to spell out every single tiny logical step. I wonder if this is a stylistic choice or something that actually helps the model get to the end.

vessenes · 2025-07-21T20:49:58 1753130998

They actually do help - in that they give the model more computation time and also allow realtime management of the input context by the model. You can see this same behavior in the excessive comment writing some coding models engage in; Anthropic interviews said these do actually help the model.

johnfn · 2025-07-22T00:48:30 1753145310

Gemini did not one-shot these answers; it did its thinking elsewhere (probably not released by Google) and then it consolidated it down into what you see in the PDF. From the article:

> We achieved this year’s result using an advanced version of Gemini Deep Think – an enhanced reasoning mode for complex problems that incorporates some of our latest research techniques, including parallel thinking. This setup enables the model to simultaneously explore and combine multiple possible solutions before giving a final answer, rather than pursuing a single, linear chain of thought.

I don't see any parallel thinking, e.g., so that was probably elided in the final results.

noahgav · 2025-07-22T01:56:04 1753149364

Yes, because these are the answers it gave, not the thinking.

shiandow · 2025-07-21T20:10:47 1753128647

Section 2 is a case by case analysis. Those are never pretty but perfectly normal given the problem.

With OpenAI that part takes up about 2/3 if the proof even with its fragmented prose. I don't think it does much better.

sweezyjeezy · 2025-07-21T21:02:30 1753131750

It's not it being case by case that's my issue. I used do olympiads and e.g. for the k>=3 case I wouldn't write much more than:

"Since there are 3k - 3 points on the perimeter of the triangle to be covered, and any sunny line can pass through at most two of them, it follows that 3k − 3 ≤ 2k, i.e. k ≤ 3."

Gemini writes:

Let Tk be the convex hull of Pk. Tk is the triangle with vertices V1 = (1, 1), V2 = (1, k), V3 = (k, 1). The edges of Tk lie on the lines x = 1 (V), y = 1 (H), and x + y = k + 1 (D). These lines are shady.

Let Bk be the set of points in Pk lying on the boundary of Tk. Each edge contains k points. Since the vertices are distinct (as k ≥ 2), the total number of points on the boundary is |Bk| = 3k − 3.

Suppose Pk is covered by k sunny lines Lk. These lines must cover Bk. Let L ∈ Lk. Since L is sunny, it does not coincide with the lines containing the edges of Tk. A line that does not contain an edge of a convex polygon intersects the boundary of the polygon at most at two points. Thus, |L ∩ Bk| ≤ 2. The total coverage of Bk by Lk is at most 2k. We must have |Bk| ≤ 2k. 3k − 3 ≤ 2k, which implies k ≤ 3.

shiandow · 2025-07-21T23:33:41 1753140821

I'll admit I didn't look to deeply if it could be done simpler, but surely that is still miles better than what OpenAI did? At least Gemini's can be simplified. OpenAI labels all points and then enumerates all the lines that go through them.

sweezyjeezy · 2025-07-14T11:04:22 1752491062

BTC is a (roughly) net-zero enterprise, every dollar taken out of the system comes from someone else putting a dollar in. Sure, if you had a crystal ball you could have made millions, but if everyone else ALSO had that same crystal ball you couldn't, since traders are mostly just shuffling money between themselves anyway.

There's no point kicking yourself over not foreseeing a far-fetched future scenario, if you were at a casino and a roulette spin landed on 12 - would you feel bad for not betting on that happening, despite having no good information it would land on that?

NomDePlum · 2025-07-14T23:15:24 1752534924

A lot of the stock market is built on this model too surely?

I get some of it works off dividends etc, but so much is sentiment driven and also based off of someone else making a loss.

sweezyjeezy · 2025-07-07T22:42:00 1751928120

FWIW the original ARC was published in 2019, just after GPT-2 but a while before GPT-3. I work in the field, I think that discussing AGI seriously is actually kind of a recent thing (I'm not sure I ever heard the term 'AGI' until a few years ago). I'm not saying I know he didn't feel that, but he doesn't talk in such terms in the original paper.

cubefox · 2025-07-07T23:30:36 1751931036

> We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.

https://arxiv.org/abs/1911.01547

davidclark · 2025-07-08T16:38:43 1751992723

> It is important to note that ARC is a work in progress, not a definitive solution; it does not fit all of the requirements listed in II.3.2, and it features a number of key weaknesses…

Page 53

> The study of general artificial intelligence is a field still in its infancy, and we do not wish to convey the impression that we have provided a definitive solution to the problem of characterizing and measuring the intelligence held by an AI system.

Page 56

Davidzheng · 2025-07-09T03:11:04 1752030664

It's in the OpenAI charter...

sweezyjeezy · 2025-07-05T15:37:04 1751729824

100% - the quality group only had one chance to impress the teacher, whereas quantity group had dozens. The conclusion drawn from this in the text seems to be based on assumptions. We don't actually know how many intermediate photographs the quality group took as well, and without knowing that and also checking the quality of those, it's hard to say anything useful.

sweezyjeezy · 2025-06-23T21:27:20 1750714040

I don't think the author of this article is making any strong prediction, in fact I think a lot of the article is a critique of whether such an extrapolation can be done meaningfully.

Most of these models predict superhuman coders in the near term, within the next ten years. This is because most of them share the assumption that a) current trends will continue for the foreseeable future, b) that “superhuman coding” is possible to achieve in the near future, and c) that the METR time horizons are a reasonable metric for AI progress. I don’t agree with all these assumptions, but I understand why people that do think superhuman coders are coming soon.

Personally I think any model that puts zero weight on the idea that there could be some big stumbling blocks ahead, or even a possible plateau, is not a good model.

XorNot · 2025-06-23T21:47:23 1750715243

The primary question is always whether they'd have made those sorts of predictions based on the results they were seeing on the field from the same amount of time in the past.

Pre-CharGPT I very much doubt the bullish predictions on AI would've been made the way they are now.

ben_w · 2025-07-01T13:01:51 1751374911

I think what changed was not the predictions, which were still being made in similar ways, but how often and how virally such predictions spread.

sweezyjeezy · 2025-06-22T15:08:23 1750604903

Yes 100% this. If you're comparing two layouts there's no great reason to treat one as a 'treatment' and one as a 'control' as in medicine - the likelihood is they are both equally justified. If you run an experiment and get p=0.93 on a new treatment - are you really going to put money on that result being negative, and not updating the layout?

The reason we have this stuff in medicine is because it is genuinely important, and because a treatment often has bad side-effects, it's worse to give someone a bad treatment than to give them nothing, that's the point of the Hypocratic oath. You don't need this for your dumb B2C app.

sweezyjeezy · 2025-06-22T14:01:28 1750600888

This is a subtle point that even a lot of scientists don't understand. A p value or < 0.05 doesn't mean "there is less than a 5% chance the treatment is not effective". It means that "if the treatment was only as effective, (or worse) than the original, we'd have < 5% chance of seeing results this good". Note that in the second case we're making a weaker statement - it doesn't directly say anything about the particular experiment we ran and whether it was right or wrong with any probability, only about how extreme the final result was.

Consider this example - we don't change the treatment at all, we just update its name. We split into two groups and run the same treatment on both, but under one of the two names at random. We get a p value of 0.2 that the new one is better. Is it reasonable to say that there's a >= 80% chance it really was better, knowing that it was literally the same treatment?