Well, that would also require all the services to support webauthn/FIDO, which a lot of them don't. Some who do support it only allow one key or trivial bypass via "security questions".
In July, I predicted future AI models would someday learn to cheat on SWE-bench by accessing future git history. Turns out, they were already doing it!
There's a big blue button on the top left that says "Get API key" and after you perform a prompt in the UI there's another button that says "Get code". Between those you should be good to go.
Each evaluation is linked to detailed information we collect about the model, including its release date, the organization behind it, and in some cases our estimate of the amount of compute used to train the model.
At the moment, the database features results from two benchmarks:
- GPQA Diamond: This is a higher-quality, challenging subset of the GPQA benchmark, which tests models’ ability to answer PhD-level multiple choice questions about chemistry, physics, and biology.
- MATH Level 5: This is a subset of the hardest questions from the MATH benchmark, a dataset of high-school level competition math problems.
We plan to rapidly expand our suite of benchmarks to create a thorough picture of AI progress, by adding benchmarks such as FrontierMath, SWE-Bench-Verified, and SciCodeBench.
Hi! I'm Tom, a machine learning engineer at the nonprofit research institute Epoch AI [0]. I've been working on building infrastructure to:
* run LLM evaluations systematically and at scale
* share the data with the public in a rigorous and transparent way
We use the UK government's Inspect [1] library to run the evaluations.
As soon as I saw this news on HN, I evaluated Mistral Small 3 on MATH [2] level 5 (hardest subset, 1,324 questions). I get an accuracy of 0.45 (± 0.011). We sample the LLM 8 times for each question, which lets us obtain less noisy estimates of mean accuracy, and measure the consistency of the LLM's answers. The 1,324*8=10,584 samples represent 8.5M tokens (2M in, 6.5M out).
Note that MATH is a different benchmark from the MathInstruct [3] mentioned in the OP.
It's still early days for Epoch AI's benchmarking work. I'm developing a systematic database of evaluations run directly by us (so we can share the full details transparently), which we hope to release very soon.
One question i have regarding evals is, what sampling temperature and/or method do you use? As far as i understand temperature/ method can impact model output alot. Would love to here you're thoughts on how these different settings of the same model can impact output and how to go about evaluating models when its not clear how to use the to their fullest
For models we run ourselves from the weights, at the moment we'd use vLLM's defaults, but this may warrant more thought and adjustment. Other things being equal, I prefer to use an AI lab's API, with settings as vanilla as possible, so that we essentially defer to them on these judgments. For example, this is why we ran this Mistral model from Mistral's API instead of from the weights.
I believe the `temperature` parameter, for example, has different implementations across architectures/models, so it's not as simple as picking a single temperature number for all models.
However, I'm curious if you have further thoughts on how we should approach this.
By the way, in the log viewer UI, for any model call, you can click on the "API" button to see the payloads that were sent. In this case, you can see that we do not send any values to Mistral for `top_p`, `temperature`, etc.