Hacker Newsnew | past | comments | ask | show | jobs | submit | jsnell's commentslogin

The abstract of the article is very short, and seems pretty clear to both of your questions.

This is what is special about them:

> a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now;

I.e. these are problems of some practical interest, not just performative/competitive maths.

And this is what is know about the solutions:

> the answers are known to the authors of the questions but will remain encrypted for a short time.

I.e. a solution is known, but is guaranteed to not be in the training set for any AI.


> I.e. a solution is known, but is guaranteed to not be in the training set for any AI.

Not a mathematician and obviously you guys understand this better than I do. One thing I can't understand is how they're going to judge if a solution was AI written or human written. I mean, a human could also potentially solve the problem and pass it off as AI? You might say why would a human want to do that? Normal mathematicians might not want to do that. But mathematicians hired by Anthropic or OpenAI might want to do that to pass it off as AI achievements?


Well, I think the paper answers that too. These problems are intended as a tool for honest researchers to use for exploring the capabilities of current AI models, in a reasonably fair way. They're specifically not intended as a rigorous benchmark to be treated adversarially.

Of course a math expert could solve the problems themselves and lie by saying that an AI model did it. In the same way, somebody with enough money could secretly film a movie and then claim that it was made by AI. That's outside the scope of what this paper is trying to address.

The point is not to score models based on how many of the problems they can solve. The point is to look at the models' responses and see how good they are at tackling the problem. And that's why the authors say that ideally, people solving these problems with AI would post complete chat transcripts (or the equivalent) so that readers can assess how much of the intellectual contribution actually came from AI.


> these are problems of some practical interest, not just performative/competitive maths.

FrontierMath did this a year ago. Where is the novelty here?

> a solution is known, but is guaranteed to not be in the training set for any AI.

Wrong, as the questions were poses to commercial AI models and they can solve them.

This paper violates basic benchmarking principles.


> Wrong, as the questions were poses to commercial AI models and they can solve them.

Why does this matter? As far as I can tell, because the solution is not known this only affects the time constant (i.e. the problems were known for longer than a week). It doesn't seem that I should care about that.


Because the companies have the data and can solve them -- so providing the question to a company with the necessary manpower, one cannot guarantee anymore that the solution is not known, and not contained in the training sample.

What the OP was pointing out is two typical tells for lazy ChatGPT-generated text, right in the intro. (The m-dash, "it's not just X, it's Y").

Of course that kind of heuristic can have false positives, and not every accusation of AI-written content on HN is correct. But given how much stuff Gregg has written over the years, it's easy to spot-check a few previous posts. This clearly isn't his normal style of writing.

Once we know this blog was generated by a chatbot, why would the reader care about any of it? Was there a Mia, or did the prompt ask for a humanizing anecdote? Basically, show us the prompt rather than the slop.



I'm not sure the volume here is particularly different to past examples. I think the main difference is that there was no custom harness, tooling or fine-tuning. It's just the out of the box capabilities for a generally available model and a generic agent.

But it's not failing 50% of the time. Their status page[0] shows about 99.6% availability for both the API and Claude Code. And specifically for the vulnerability finding use case that the article was about and you're dismissing as "not worth much", why in the world would you need continuous checks to produce value?

[0] https://status.claude.com/


Did you actually look at these?

> https://github.com/jyn514/saltwater

This is just a frontend. It uses Cranelift as the backend. It's missing some fairly basic language features like bitfields and variadic functions. And if I'm reading the documentation right, it requires all the source code to be in a single file...

> https://github.com/ClementTsang/rustcc

This will compile basically no real-world code. The only supported data type is "int".

> https://github.com/maekawatoshiki/rucc

This is just a frontend. It uses LLVM as the backend.


"Couldn't stick to the ABI ... despite CPU manuals being available" is a bizarre interpretation. What the article describes is the generated code being too large. That's an optimization problem, not a "couldn't follow the documentation" problem.

And it's a bit of a nasty optimization problem, because the result is all or nothing. Implementing enough optimizations to get from 60kB to 33kB is useless, all the rewards come from getting to 32kB.


No? That was a frontend for a toy language calling using LLVM as the backend. This is a totally self-contained compiler that's capable of compiling the Linux kernel. What's the part that you think is similar?

That does not sound like credible estimate, and your link does not make any such claim.

I don't think the blog post itself is using that emoji font. The screenshot on the Noto Emoji Github page[0] doesn't look like it's using any gradients for the heart emoji, just flat shading. But it is using gradients for some of the other emojis (e.g. the croissant), and obviously the SVG fallback is all or nothing, not per-glyph.

[0] https://github.com/googlefonts/noto-emoji


You need to look closer; the heart emoji has a flat fill, but a gradient in its outline stroke, from lighter-than-red near the top, to darker-than-red on the bottom.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: