This is extremely similar to Karpathy's idea of a "cognitive core" [1]; an extremely small model with near-0 encyclopedic knowledge and basic reasoning and tool-use capabilities.
I don't care about the supposed ecological consequences of AI. If we need more water, we build more desalination plants. If we need more electricity, we build more nuclear reactors.
This is purely a technological problem and not a moral one.
There were people before “ai” in other industries who were like “I don’t care about ecological consequences of my actions”. We as society have turned them into law-abiding citizens. You will be there too. Don’t worry. Time will come. You will be regulated. Same as cryptocurrencies, chemical, oil and gas, …
If you were capable of time travel and you could go to the past and convince world government of the evil oil and gas industries, and that their expansion should be prevented, would you have done it? Would you have prevented the technological and sociatal advances that came from oil and gas to avoid their ecological consequences?
If you answer yes, I don't think we can agree on anything. If you answer no, I think you are a hypocrite.
In which sense is it regulated? Are they regulated in any way that matters for this discussion? Have their ecological consequences been avoided by regulation? The oil and gas industries continue to be the biggest culprits of climate change, and that cannot be changed by law.
If data centers were "regulated" would that make you happy? Even if those data centers continued to use the same amount of electricity and the same amount of water?
Clean water is a public good, it is required for basic human survival. It is needed to grow crops to feed people. Both of these uses depend on fairly cheap water, in many many places the supply of sufficiently cheap water is already constrained. This is causing a shortage for both basic human needs, and agriculture.
Who will pay for the desalination plant construction? Who will pay for the operation?
If the AI companies are ready to pay the full marginal cost of this "new water", and not free-load on the already insufficient supply needed for more important uses, then fine. But I very much doubt that is what will happen.
https://www.thedalles.org/news_detail_T4_R180.php - "The fees paid by Google have funded essential upgrades to our water systems, ensuring reliable service and addressing the City's growing needs. Additionally, Google continues to pay for its water use and contributes to infrastructure projects that exceed the requirements of its facilities."
https://commerce.idaho.gov/press-releases/meta-announces-kun... - "As part of the company’s commitment to Kuna, Meta is investing approximately $50 million in a new water and sewer system for the city. Infrastructure will be constructed by Meta and dedicated to the City of Kuna to own and operate."
For desalination, the important part is paying the ongoing cost. The opex is much higher, and it's not fair to just average that into the supply for everyone to pay.
Are any data centers using desalinated water? I thought that was a shockingly expensive and hence very rare process.
(I asked ChatGPT and it said that some of the Gulf state data centers do.)
They do use treated (aka drinking) water, but that's a relatively inexpensive process which should be easily covered by the extra cash they shovel into their water systems on an annual basis.
Read the comment I replied to, they proposed that since desalination is possible, there can be no meaningful shortage of water.
And yes, many places have plenty of water. After some Capex improvements to the local system, a datacenter is often net-helpful, as they spread the fixed cost of the water system cost out over more gallons delivered.
But many places don't have lots of water to spare.
You just have to extrapolate the improvements in consistency in image model from the last couple of years and apply it to these kinds of video models. When in a couple of years they can generate videos of many physical phenomena such that they are nearly indistinguishably from reality, you'll se why they are called "world models".
I remember doing this kind of test in a vanilla transformer trained on my laptop on a small text dataset. I basically added N^3 attention where each layer could pay attention to previous layers. It didn't improve anything and was much slower.
Hard to say whether something scales or not from a couple dozen million parameters to an actual billion-sized model, but I have the impression that the nature of the residual stream and its high dimensionality allows any layer to access information of previous layers if the transformers needs it.
It seems like this could be solved by partial structured output, where the structure of the JSON itself is constrained, but the values of the JSON entries is not (so even if "quantity" here is set to int, the model can output "52.2"). Of course, we would need additional parsing, but I think it's a fair compromise.
And about structured outputs messing with chain-of-thought... Is CoT really used with normal models nowadays? I think that if you need CoT you might as well use a reasoning model, and that solves the problem.
I'm confused about the "Accuracy vs Cost" section. Why is Gemini 3 Pro so cheap? It's basically the cheapest model in the graph (sans Llama 4 and Mistral Large 3) by a wide margin, even compared to Gemini 3 Flash. Is that an error?
It's not an error, Gemini 3 Pro is just somehow able to complete the benchmark while using way fewer tokens than any other model. Gemini 3 Flash is way cheaper per token, but it also tends to generate a ton of reasoning tokens to get to its answer.
They have a similar chart that compares results across all their benchmarks vs. cost and 3 Flash is about half as expensive as 3 Pro there despite being four times cheaper per token.
Transformers show remarkable versatility across domains, suggesting the existence of inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in vision transformers (ViTs) by pretraining on procedurally-generated data devoid of visual or semantic content. We generate this data with simple algorithms such as formal grammars, so the results bear no relationship to either natural or synthetic images. We use this procedurally-generated data to pretrain ViTs in a warm-up phase that bypasses their visual patch embedding mechanisms, thus encouraging the models to internalise abstract computational priors. When followed by standard image-based training, this warm-up significantly improves data efficiency, convergence speed, and downstream performance. On ImageNet-1k for example, allocating just 1% of the training budget to procedural data improves final accuracy by over 1.7%. In terms of its effect on performance, 1% procedurally generated data is thus equivalent to 28% of the ImageNet-1k data. These findings suggest a promising path toward new data-efficient and domain-agnostic pretraining strategies.
Kind-of. You could theoretically use LoRA for this, in fact, but it probably wouldn't have enough capacity to make it a proper substitute of the attention mechanism. Instead a full MLP is trained as input chunks get processed.
This is an oversimplification of what Titans does. The model performs nested learned, where the model learns during inference, and during training the model weights learn _how and what_ to learn during inference. If the input contains junk of irrelevant information, the model most likely learned during training to assign low surprise query and key embeddings to those tokens, because learning those junk tokens would have hurt the overall ability of the model to predict subsequent next tokens (and thus, it would have had increased the training loss).
[1] https://x.com/karpathy/status/1938626382248149433
reply