tadamcz's comments

tadamcz · 2025-09-08T17:13:54 1757351634

Using a security key instead of TOTP would have prevented this.

tadamcz · 2025-09-08T17:03:32 1757351012

Using a security key as 2FA instead of TOTP would have prevented this attack, right?

If you maintain popular open source packages for the love of God get yourself a couple of security keys.

SahAssar · 2025-09-08T23:00:54 1757372454

Well, that would also require all the services to support webauthn/FIDO, which a lot of them don't. Some who do support it only allow one key or trivial bypass via "security questions".

tadamcz · 2025-09-05T16:19:15 1757089155

In July, I predicted future AI models would someday learn to cheat on SWE-bench by accessing future git history. Turns out, they were already doing it!

tadamcz · 2025-03-12T17:32:31 1741800751

The launch post for Gemma 3 says:

> use Gemma 3 with the Google GenAI SDK

https://blog.google/technology/developers/gemma-3/

Does this mean (serverless) API access? I haven't been able to do so or find docs that explain how to.

ZeroCool2u · 2025-03-12T19:31:38 1741807898

You can just go here: https://aistudio.google.com/prompts/new_chat

Select Gemma 3 from the drop down on the right side.

tadamcz · 2025-03-12T23:56:33 1741823793

This doesn't explain how to get API access

ZeroCool2u · 2025-03-13T14:56:08 1741877768

There's a big blue button on the top left that says "Get API key" and after you perform a prompt in the UI there's another button that says "Get code". Between those you should be good to go.

tadamcz · 2025-02-07T14:27:30 1738938450

Other discussion: https://news.ycombinator.com/item?id=42972581

tadamcz · 2025-02-07T14:00:29 1738936829

Hi, I'm the maintainer of the Epoch AI Benchmarking Hub.

We're building a transparent public dataset of AI model performance.

We log and publish every prompt and response -- not just aggregate scores. We even store and display the json bodies for every API call. Example: https://logs.epoch.ai/inspect-viewer/484131e0/viewer.html?lo...

Each evaluation is linked to detailed information we collect about the model, including its release date, the organization behind it, and in some cases our estimate of the amount of compute used to train the model.

At the moment, the database features results from two benchmarks:

- GPQA Diamond: This is a higher-quality, challenging subset of the GPQA benchmark, which tests models’ ability to answer PhD-level multiple choice questions about chemistry, physics, and biology.

- MATH Level 5: This is a subset of the hardest questions from the MATH benchmark, a dataset of high-school level competition math problems.

We plan to rapidly expand our suite of benchmarks to create a thorough picture of AI progress, by adding benchmarks such as FrontierMath, SWE-Bench-Verified, and SciCodeBench.

Announcement post: https://epoch.ai/blog/benchmarking-hub-update

tadamcz · 2025-01-30T20:59:59 1738270799

Hi! I'm Tom, a machine learning engineer at the nonprofit research institute Epoch AI [0]. I've been working on building infrastructure to:

* run LLM evaluations systematically and at scale

* share the data with the public in a rigorous and transparent way

We use the UK government's Inspect [1] library to run the evaluations.

As soon as I saw this news on HN, I evaluated Mistral Small 3 on MATH [2] level 5 (hardest subset, 1,324 questions). I get an accuracy of 0.45 (± 0.011). We sample the LLM 8 times for each question, which lets us obtain less noisy estimates of mean accuracy, and measure the consistency of the LLM's answers. The 1,324*8=10,584 samples represent 8.5M tokens (2M in, 6.5M out).

You can see the full transcripts here in Inspect’s interactive interface: https://epoch.ai/inspect-viewer/484131e0/viewer?log_file=htt...

Note that MATH is a different benchmark from the MathInstruct [3] mentioned in the OP.

It's still early days for Epoch AI's benchmarking work. I'm developing a systematic database of evaluations run directly by us (so we can share the full details transparently), which we hope to release very soon.

[0]: https://epoch.ai/

[1]: https://github.com/UKGovernmentBEIS/inspect_ai

[2]: https://arxiv.org/abs/2103.03874

[3]: https://huggingface.co/datasets/TIGER-Lab/MathInstruct

coalteddy · 2025-01-30T21:30:18 1738272618

Thanks a lot for this eval!

One question i have regarding evals is, what sampling temperature and/or method do you use? As far as i understand temperature/ method can impact model output alot. Would love to here you're thoughts on how these different settings of the same model can impact output and how to go about evaluating models when its not clear how to use the to their fullest

tadamcz · 2025-01-30T23:00:26 1738278026

Generally, we'll use the API provider's defaults.

For models we run ourselves from the weights, at the moment we'd use vLLM's defaults, but this may warrant more thought and adjustment. Other things being equal, I prefer to use an AI lab's API, with settings as vanilla as possible, so that we essentially defer to them on these judgments. For example, this is why we ran this Mistral model from Mistral's API instead of from the weights.

I believe the `temperature` parameter, for example, has different implementations across architectures/models, so it's not as simple as picking a single temperature number for all models.

However, I'm curious if you have further thoughts on how we should approach this.

By the way, in the log viewer UI, for any model call, you can click on the "API" button to see the payloads that were sent. In this case, you can see that we do not send any values to Mistral for `top_p`, `temperature`, etc.

tadamcz · on Jan 29, 2024

Hi -- I'm the author of this tool, glad to see it on Hacker News

I found this post after I noticed that some users were coming to the site from HN even though I've never posted it here :)

Let me know if you have any questions!