Supposedly this is the model card. Very impressive results. https://pbs.twimg.co...

tweakimp · 2025-11-18T15:48:36 1763480916

Every time I see a table like this numbers go up. Can someone explain what this actually means? Is there just an improvement that some tests are solved in a better way or is this a breakthrough and this model can do something that all others can not?

rvnx · 2025-11-18T15:54:54 1763481294

This is a list of questions and answers that was created by different people.

The questions AND the answers are public.

If the LLM manages through reasoning OR memory to repeat back the answer then they win.

The scores represent the % of correct answers they recalled.

tylervigen · 2025-11-18T20:53:40 1763499220

That is not entirely true. At least some of these tests (like HLE and ARC) take steps to keep the evaluation set private so that LLMs can’t just memorize the answers.

You could question how well this works, but it’s not like the answers are just hanging out on the public internet.

slaterbug · 2025-11-19T03:00:15 1763521215

Excuse my ignorance, how do these companies evaluate their models against the evaluation set without access to it?

ricopags · 2025-11-19T08:19:08 1763540348

Cooperation with the eval admins

stavros · 2025-11-18T16:12:43 1763482363

I estimate another 7 months before models start getting 115% on Humanity's Last Exam.

HardCodedBias · 2025-11-18T16:08:39 1763482119

If you believe another thread the benchmarks are comparing Gemini-3 (probably thinking) to GPT-5.1 without thinking.

The person also claims that with thinking on the gap narrows considerably.

We'll probably have 3rd party benchmarks in a couple of days.

iamdelirium · 2025-11-18T16:40:42 1763484042

This is easily shown that the numbers are for GPT 5.1 thinking high.

Just go to the leaderboard website and see for yourself: https://arcprize.org/leaderboard