Hacker Newsnew | past | comments | ask | show | jobs | submit | Tiberium's commentslogin

The frontend examples, especially the first one, look uncannily similar to what Gemini 3 Pro usually produces. Make of that what you will :)

EDIT: Also checked the chats they shared, and the thinking process is very similar to the raw (not the summarized) Gemini 3 CoT. All the bold sections, numbered lists. It's a very unique CoT style that only Gemini 3 had before today :)


I don't mind if they're distilling frontier models to make them cheaper, and open-sourcing the weights!

Same, although gemini 3 flash already gives a run for the cheaper aspect but a part of me really wants to get open source too because that way if I really want to some day, I can have privacy or get my own hardware to run it

I genuinely hope that gemini 3 flash gets open sourced but I feel like that can actually crash the AI bubble if something like this happens because I genuinely feel like although there are still some issues of vibing with the overall model itself, I find it very competent overall and fast and I genuinely feel like at this point, there might be some placebo effects too but in reality, the model feels really solid.

Like all of western countries (mostly) wouldn't really have a point to compete or incentives if someone open sources the model because then the competition would rather be on providers/ their speeds (like how groq,cerebras have an insane speed)

I had heard that google would allow institutions like universities to self host gemini models or similar so there are chances as to what if the AI bubble actually pops up if gemini models or top tier models accidentally get leaked or similar but I genuinely doubt of it as happening and there are many other ways that the AI bubble will pop.


Models being open weights lets infrastructure providers compete in delivering models as service, fastests and cheapest.

At some point companies should be forced to release the weights after a reasonable time passed since they sold the service for the first time. Maybe after 3 years or so.

It would be great for competition and security research.


Yeah, I think it sometimes even repeats Gemini's injected platform instructions. It's pretty curious because a) Gemini uses something closer to the "chain of draft" and never repeats them in full naturally, only the relevant part, and b) these instructions don't seem to have any effect in GLM, it repeats them in the CoT but never follows them. Which is a real problem with any CoT trained through RL (the meaning diverges from the natural language due to reward hacking). Is it possible they used is in the initial SFT pass to improve the CoT readability?

How is the raw Gemini 3 CoT accessed? Isn't it hidden?

There are tricks on the API to get access to the raw Gemini 3 CoT, it's extremely easy compared to getting CoT of GPT-5 (very, very hard).

What are you referring to? I see the 'reasoning' in OpenRouter for GPT-5.2, I was under the impression that is the CoT.

Yes, that's exactly what I'm referring to. When you're using the direct Gemini API (AI Studio/Vertex), with specific tricks you can get the raw reasoning/CoT output of the model, not the summary.

in antigravity gemini sometimes inserts its CoT directly into code comments lol

From a tweet: https://x.com/i/status/2001821298109120856

> can someone help folks at Mistral find more weak baselines to add here? since they can't stomach comparing with SoTA....

> (in case y'all wanna fix it: Chandra, dots.ocr, olmOCR, MinerU, Monkey OCR, and PaddleOCR are a good start)


I've worked on document extraction a lot and while the tweet is too flippant for my taste, it's not wrong. Mistral is comparing itself to non-VLM computer vision services. While not necessarily what everyone needs, they are a very different beasts compared to VLM based extraction because it gives you precise bounding boxes, usually at the cost of larger "document understanding".

Its failure mode are also vastly different. VLM-based extraction can misread entire sentences or miss entire paragraphs. Sonnet 3 had that issue. Computer vision models instead will make in-word typos.


Why not use both? I just built a pipeline for document data extraction that uses PaddleOCR, then Gemini 3 to check + fix errors. It gets close to 99.9% on extraction from financial statements finally on par with humans.

I did the opposite. Tesseract to get bboxes, words, and chars and then mistral on the clips with some reasonable reflow to preserve geometry. Paddle wasn’t working on my local machine (until I found RapidOCR). Surya was also very good but because you can’t really tweak any knobs, when it failed it just kinda failed. But Surya > Rapid w/ Paddle > DocTr > Tesseract while the latter gave me the most granularity when I needed it.

Edit: Gemini 2.0 was good enough for VLM cleanup, and now 2.5 or above with structured output make reconstruction even easier.


This is The Way. Remember AI doesn't have to replace existing solutions but can tactfully supplement it.

Is DeepSeek's not VLM?

after clicking on your link I browsed twitter for a minute and damn that place has become weird (or maybe it always was?)

As someone who has been on Twitter since 2007, it’s radically changed in the last few years to the point of being unrecognizable.

Also, do you know if their benchmarks are available?

In their website, the benchmarks say “Multilingual (Chinese), Multilingual (East-asian), Multilingual (Eastern europe), Multilingual (English), Multilingual (Western europe), Forms, Handwritten, etc.” However, there’s no reference to the benchmark data.


I'd want to see a comparison with Qwen 3 VL 235B-A22B, which is IME significantly better than MinerU.

On the OP link, they compare themselves to the capabilities of leaderboard AI's and beat them.

For 2.5 Flash Preview the price was specifically much cheaper for the no-reasoning mode, in this case the model reasons by default so I don't think they'll increase the price even further.

Yes, but also most of the increase in 3 Flash is in the input context price, which isn't affected by reasoning.

It is affected if it has to round-trip, e.g. because it's making tool calls.

You can still set thinking budget to 0 to completely disable reasoning, or set thinking level to minimal or low.

>You cannot disable thinking for Gemini 3 Pro. Gemini 3 Flash also does not support full thinking-off, but the minimal setting means the model likely will not think (though it still potentially can). If you don't specify a thinking level, Gemini will use the Gemini 3 models' default dynamic thinking level, "high".

https://ai.google.dev/gemini-api/docs/thinking#levels


I was talking about Gemini 3 Flash, and you absolutely can disable reasoning, just try sending thinking budget: 0. It's strange that they don't want to mention this, but it works.

Gemini 3 Flash is in the second sentence.

See, this is what happens when you turn off thinking completely.

Yet again Flash receives a notable price hike: from $0.3/$2.5 for 2.5 Flash to $0.5/$3 (+66.7% input, +20% output) for 3 Flash. Also, as a reminder, 2 Flash used to be $0.1/$0.4.

Yes, but this Flash is a lot more powerful - beating Gemini 3 Pro on some benchmarks (and pretty close on others).

I don't view this as a "new Flash" but as "a much cheaper Gemini 3 Pro/GPT-5.2"


I would be less salty if they gave us 3 Flash Lite at same price as 2.5 Flash or cheaper with better capability, but they still focus on the pricier models :(

We'll probably get 3 Flash Lite eventually, it just takes time to distill the models, and you want to start with the one that is likely to bring in more money.

Same! I want to do some data stuff from documents and 2.0 pricing was amazing, but the constant increases go the wrong way for this task :/

Right, depends on your use cases. I was looking forward to the model as an upgrade to 2.5 Flash, but when you're processing hundreds of millions of tokens a day (not hard to do if you're dealing in documents or emails with a few users), the economics fall apart.

At least you can play a lot of the Pico-8 games through the website for free - their player shows virtual controller buttons, although for some games it can be awkward.

The only table where they showed comparisons against Opus 4.5 and Gemini 3:

https://x.com/OpenAI/status/1999182104362668275

https://i.imgur.com/e0iB8KC.png


100% on the AIME (assuming its not in the training data) is pretty impressive. I got like 4/15 when I was in HS...

The no tools part is impressive, with tools every model gets 100%

If I recall, the AIME answers are always 4 digits numbers. And most of the problems are of the type where if you have a candidate number it's reasonable to validate its correctness. So easy to brute force all 4 digit ints with code.

tl;dr; humans would do much better too if they could use programming tools :)


uh no it's not solved by looping over 4 digit numbers when it uses tools


This looks cherry-picked, for example Claude Opus had a higher score on SWE-Bench Verified so they conveniently left it out, also GDPval is literally a benchmark made by OpenAI

And who believes that the difference between 91.9% and 92.4% is significant in these benchmarks? Clearly these have margins of error that are swept under the rug.

agreed.

The fact that the post is comparing their reasoning model against gemini 3 pro (the "non reasoning" model) and not gemini 3 pro deep think (the reasoning one) is quite nasty. If you compare GPT5.2 thinking to gemini 3 pro deep think, the scores are quite similar (sometimes one is better sometimes the other one is)

uh oh, where did SWE bench go :D

maybe they will release with gpt-5.2-codex

The fake claim here is compression. The results in the repo are likely real, but they're done by running the full transformer teacher model every time. This doesn't achieve anything novel.

That's not how the method works... The full transformer is only needed once to extract the activation fields. That step can even be done offline. Then the teacher can be discarded entirely. The compression result refers to the size of the learned field representation and the small student head that operates directly on it. Simple. No fake claim there. Inference with the student does not involve the transformer at all.

If you look at the student-only scripts in the repo, those runs never load the teacher. That's the novel part.


I agree the claim is (perhaps purposefully) confusing.

What they achieved is to create tiny student models. Trained on specific set of input. Off the teacher model's output.

There is clearly novelty in the method and what it achieve. Whether what it achieve would cover many cases that's another question.


Can you please share the relevant code that has the training of such a tiny student model that can operate independently of the big teacher model after training? The repository has no such code.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: