More

Tiberium · 2025-12-22T20:42:40 1766436160

The frontend examples, especially the first one, look uncannily similar to what Gemini 3 Pro usually produces. Make of that what you will :)

EDIT: Also checked the chats they shared, and the thinking process is very similar to the raw (not the summarized) Gemini 3 CoT. All the bold sections, numbered lists. It's a very unique CoT style that only Gemini 3 had before today :)

reissbaker · 2025-12-22T20:43:40 1766436220

I don't mind if they're distilling frontier models to make them cheaper, and open-sourcing the weights!

Imustaskforhelp · 2025-12-22T21:41:45 1766439705

Same, although gemini 3 flash already gives a run for the cheaper aspect but a part of me really wants to get open source too because that way if I really want to some day, I can have privacy or get my own hardware to run it

I genuinely hope that gemini 3 flash gets open sourced but I feel like that can actually crash the AI bubble if something like this happens because I genuinely feel like although there are still some issues of vibing with the overall model itself, I find it very competent overall and fast and I genuinely feel like at this point, there might be some placebo effects too but in reality, the model feels really solid.

Like all of western countries (mostly) wouldn't really have a point to compete or incentives if someone open sources the model because then the competition would rather be on providers/ their speeds (like how groq,cerebras have an insane speed)

I had heard that google would allow institutions like universities to self host gemini models or similar so there are chances as to what if the AI bubble actually pops up if gemini models or top tier models accidentally get leaked or similar but I genuinely doubt of it as happening and there are many other ways that the AI bubble will pop.

scotty79 · 2025-12-23T12:45:01 1766493901

Models being open weights lets infrastructure providers compete in delivering models as service, fastests and cheapest.

At some point companies should be forced to release the weights after a reasonable time passed since they sold the service for the first time. Maybe after 3 years or so.

It would be great for competition and security research.

orbital-decay · 2025-12-23T07:50:06 1766476206

Yeah, I think it sometimes even repeats Gemini's injected platform instructions. It's pretty curious because a) Gemini uses something closer to the "chain of draft" and never repeats them in full naturally, only the relevant part, and b) these instructions don't seem to have any effect in GLM, it repeats them in the CoT but never follows them. Which is a real problem with any CoT trained through RL (the meaning diverges from the natural language due to reward hacking). Is it possible they used is in the initial SFT pass to improve the CoT readability?

ImprobableTruth · 2025-12-22T22:16:48 1766441808

How is the raw Gemini 3 CoT accessed? Isn't it hidden?

Tiberium · 2025-12-22T22:53:09 1766443989

There are tricks on the API to get access to the raw Gemini 3 CoT, it's extremely easy compared to getting CoT of GPT-5 (very, very hard).

ceroxylon · 2025-12-23T01:13:32 1766452412

What are you referring to? I see the 'reasoning' in OpenRouter for GPT-5.2, I was under the impression that is the CoT.

Tiberium · 2025-12-23T01:21:38 1766452898

Yes, that's exactly what I'm referring to. When you're using the direct Gemini API (AI Studio/Vertex), with specific tricks you can get the raw reasoning/CoT output of the model, not the summary.

bwat49 · 2025-12-23T15:05:29 1766502329

in antigravity gemini sometimes inserts its CoT directly into code comments lol

Tiberium · 2025-12-19T20:33:36 1766176416

From a tweet: https://x.com/i/status/2001821298109120856

> can someone help folks at Mistral find more weak baselines to add here? since they can't stomach comparing with SoTA....

> (in case y'all wanna fix it: Chandra, dots.ocr, olmOCR, MinerU, Monkey OCR, and PaddleOCR are a good start)

belval · 2025-12-19T21:00:15 1766178015

I've worked on document extraction a lot and while the tweet is too flippant for my taste, it's not wrong. Mistral is comparing itself to non-VLM computer vision services. While not necessarily what everyone needs, they are a very different beasts compared to VLM based extraction because it gives you precise bounding boxes, usually at the cost of larger "document understanding".

Its failure mode are also vastly different. VLM-based extraction can misread entire sentences or miss entire paragraphs. Sonnet 3 had that issue. Computer vision models instead will make in-word typos.

wills_forward · 2025-12-20T00:00:55 1766188855

Why not use both? I just built a pipeline for document data extraction that uses PaddleOCR, then Gemini 3 to check + fix errors. It gets close to 99.9% on extraction from financial statements finally on par with humans.

vrc · 2025-12-20T03:21:30 1766200890

I did the opposite. Tesseract to get bboxes, words, and chars and then mistral on the clips with some reasonable reflow to preserve geometry. Paddle wasn’t working on my local machine (until I found RapidOCR). Surya was also very good but because you can’t really tweak any knobs, when it failed it just kinda failed. But Surya > Rapid w/ Paddle > DocTr > Tesseract while the latter gave me the most granularity when I needed it.

Edit: Gemini 2.0 was good enough for VLM cleanup, and now 2.5 or above with structured output make reconstruction even easier.

jadbox · 2025-12-20T01:36:28 1766194588

This is The Way. Remember AI doesn't have to replace existing solutions but can tactfully supplement it.

zerocrates · 2025-12-20T06:17:05 1766211425

Is DeepSeek's not VLM?

vinckr · 2025-12-19T23:26:43 1766186803

after clicking on your link I browsed twitter for a minute and damn that place has become weird (or maybe it always was?)

crystal_revenge · 2025-12-20T01:33:20 1766194400

As someone who has been on Twitter since 2007, it’s radically changed in the last few years to the point of being unrecognizable.

ozgune · 2025-12-20T04:36:02 1766205362

Also, do you know if their benchmarks are available?

In their website, the benchmarks say “Multilingual (Chinese), Multilingual (East-asian), Multilingual (Eastern europe), Multilingual (English), Multilingual (Western europe), Forms, Handwritten, etc.” However, there’s no reference to the benchmark data.

logicprog · 2025-12-19T22:49:21 1766184561

I'd want to see a comparison with Qwen 3 VL 235B-A22B, which is IME significantly better than MinerU.

nerbert · 2025-12-20T08:47:16 1766220436

On the OP link, they compare themselves to the capabilities of leaderboard AI's and beat them.

Tiberium · 2025-12-17T20:27:57 1766003277

For 2.5 Flash Preview the price was specifically much cheaper for the no-reasoning mode, in this case the model reasons by default so I don't think they'll increase the price even further.

Tiberium · 2025-12-17T20:24:22 1766003062

Yes, but also most of the increase in 3 Flash is in the input context price, which isn't affected by reasoning.

int_19h · 2025-12-17T22:52:38 1766011958

It is affected if it has to round-trip, e.g. because it's making tool calls.

Tiberium · 2025-12-17T20:23:12 1766002992

You can still set thinking budget to 0 to completely disable reasoning, or set thinking level to minimal or low.

andai · 2025-12-18T00:35:36 1766018136

>You cannot disable thinking for Gemini 3 Pro. Gemini 3 Flash also does not support full thinking-off, but the minimal setting means the model likely will not think (though it still potentially can). If you don't specify a thinking level, Gemini will use the Gemini 3 models' default dynamic thinking level, "high".

https://ai.google.dev/gemini-api/docs/thinking#levels

Tiberium · 2025-12-18T11:57:03 1766059023

I was talking about Gemini 3 Flash, and you absolutely can disable reasoning, just try sending thinking budget: 0. It's strange that they don't want to mention this, but it works.

andai · 2025-12-18T13:39:51 1766065191

Gemini 3 Flash is in the second sentence.

throwaway127482 · 2025-12-18T14:49:37 1766069377

See, this is what happens when you turn off thinking completely.

Tiberium · 2025-12-17T16:52:29 1765990349

Yet again Flash receives a notable price hike: from $0.3/$2.5 for 2.5 Flash to $0.5/$3 (+66.7% input, +20% output) for 3 Flash. Also, as a reminder, 2 Flash used to be $0.1/$0.4.

BeetleB · 2025-12-17T16:57:52 1765990672

Yes, but this Flash is a lot more powerful - beating Gemini 3 Pro on some benchmarks (and pretty close on others).

I don't view this as a "new Flash" but as "a much cheaper Gemini 3 Pro/GPT-5.2"

Tiberium · 2025-12-17T16:59:12 1765990752

I would be less salty if they gave us 3 Flash Lite at same price as 2.5 Flash or cheaper with better capability, but they still focus on the pricier models :(

int_19h · 2025-12-17T23:03:45 1766012625

We'll probably get 3 Flash Lite eventually, it just takes time to distill the models, and you want to start with the one that is likely to bring in more money.

zzleeper · 2025-12-17T17:01:53 1765990913

Same! I want to do some data stuff from documents and 2.0 pricing was amazing, but the constant increases go the wrong way for this task :/

jexe · 2025-12-17T18:13:39 1765995219

Right, depends on your use cases. I was looking forward to the model as an upgrade to 2.5 Flash, but when you're processing hundreds of millions of tokens a day (not hard to do if you're dealing in documents or emails with a few users), the economics fall apart.

Tiberium · 2025-12-12T17:07:46 1765559266

At least you can play a lot of the Pico-8 games through the website for free - their player shows virtual controller buttons, although for some games it can be awkward.

Tiberium · 2025-12-11T18:35:47 1765478147

The only table where they showed comparisons against Opus 4.5 and Gemini 3:

https://x.com/OpenAI/status/1999182104362668275

https://i.imgur.com/e0iB8KC.png

varenc · 2025-12-11T18:45:10 1765478710

100% on the AIME (assuming its not in the training data) is pretty impressive. I got like 4/15 when I was in HS...

hellojimbo · 2025-12-11T19:44:11 1765482251

The no tools part is impressive, with tools every model gets 100%

varenc · 2025-12-11T23:41:58 1765496518

If I recall, the AIME answers are always 4 digits numbers. And most of the problems are of the type where if you have a candidate number it's reasonable to validate its correctness. So easy to brute force all 4 digit ints with code.

tl;dr; humans would do much better too if they could use programming tools :)

Davidzheng · 2025-12-12T03:36:02 1765510562

uh no it's not solved by looping over 4 digit numbers when it uses tools

Tiberium · 2025-12-11T18:35:17 1765478117

They did compare it to other models: https://x.com/OpenAI/status/1999182104362668275

https://i.imgur.com/e0iB8KC.png

enlyth · 2025-12-11T19:22:39 1765480959

This looks cherry-picked, for example Claude Opus had a higher score on SWE-Bench Verified so they conveniently left it out, also GDPval is literally a benchmark made by OpenAI

tobias2014 · 2025-12-12T01:41:25 1765503685

And who believes that the difference between 91.9% and 92.4% is significant in these benchmarks? Clearly these have margins of error that are swept under the rug.

minadotcom · 2025-12-11T21:24:41 1765488281

agreed.

sergdigon · 2025-12-12T07:20:32 1765524032

The fact that the post is comparing their reasoning model against gemini 3 pro (the "non reasoning" model) and not gemini 3 pro deep think (the reasoning one) is quite nasty. If you compare GPT5.2 thinking to gemini 3 pro deep think, the scores are quite similar (sometimes one is better sometimes the other one is)

whimsicalism · 2025-12-11T22:27:00 1765492020

uh oh, where did SWE bench go :D

whimsicalism · 2025-12-12T02:11:06 1765505466

maybe they will release with gpt-5.2-codex

Tiberium · 2025-12-10T06:57:15 1765349835

The fake claim here is compression. The results in the repo are likely real, but they're done by running the full transformer teacher model every time. This doesn't achieve anything novel.

anima-core · 2025-12-10T17:12:19 1765386739

That's not how the method works... The full transformer is only needed once to extract the activation fields. That step can even be done offline. Then the teacher can be discarded entirely. The compression result refers to the size of the learned field representation and the small student head that operates directly on it. Simple. No fake claim there. Inference with the student does not involve the transformer at all.

If you look at the student-only scripts in the repo, those runs never load the teacher. That's the novel part.

hirako2000 · 2025-12-10T12:11:47 1765368707

I agree the claim is (perhaps purposefully) confusing.

What they achieved is to create tiny student models. Trained on specific set of input. Off the teacher model's output.

There is clearly novelty in the method and what it achieve. Whether what it achieve would cover many cases that's another question.

Tiberium · 2025-12-10T12:33:42 1765370022

Can you please share the relevant code that has the training of such a tiny student model that can operate independently of the big teacher model after training? The repository has no such code.