Some press coverage (though I highly recommend just reading the paper linked as the OP, it’s quite approachable to skim without prior knowledge, and you get to see how they turn the Star Trek replicator problem into “just” a loss optimization problem with projectors and spinning mirrors!):
And as other have noted, it’s worth bearing in mind that most images here are less than a centimeter in scale; the scale bar is a millimeter. Super impressive stuff.
Minor correction. Actually confusingly the scale bars vary not just from figure to figure but from image to image within a single figure as noted in the captions. It's a rather odd choice IMO.
There's also a reasonable alignment between Tailwind's original goal (if not an explicit one) of minimizing characters typed, and a goal held by subscription-model coding agents to minimize the number of generated tokens to reach a working solution.
But as much as this makes sense, I miss the days of meaningful class names and standalone (S)CSS. Done well, with BEM and the like, it creates a semantically meaningful "plugin infrastructure" on the frontend, where you write simple CSS scripts to play with tweaks, and those overrides can eventually become code, without needing to target "the second x within the third y of the z."
Not to mention that components become more easily scriptable as well. A component running on a production website becomes hackable in the same vein of why this is called Hacker News. And in trying to minimize tokens on greenfield code generation, we've lost that hackability, in a real way.
I'd recommend: tell your AGENTS.md to include meaningful classnames, even if not relevant to styling, in generated code. If you have a configurability system that lets you plug in CSS overrides or custom scripts, make the data from those configurations searchable by the LLM as well. Now you have all the tools you need to make your site deeply customizable, particularly when delivering private-labeled solutions to partners. It's far easier to build this in early, when you have context on the business meaning of every div, rather than later on. Somewhere, a GPU may sigh at generating a few extra tokens, but it's worthwhile.
Art and engineering are both constrained optimization problems - at their core, both involve transforming a loosely defined aesthetic desire into a repeatable methodology!
And if we can call ourselves software engineers, where our day-to-day (mostly) involves less calculus and more creative interpretation of loose ideas, in the context of a corpus of historical texts that we literally call "libraries" - are we not artists and art historians?
We're far closer to Jimi than Roger, in many ways. Pots and kettles :)
That's great, but it doesn't make you or any of us engineers.
Just because I drive my car with immense focus, make precision shifts, and hit the apex of all of my turns when getting onto and off of the freeway doesn't make me a race car driver.
Engineers don't just feel good vibes about science and mix it into their work. It is the core of their work.
Simply having a methodology absolutely is not sufficient for being an engineer.
And great, you have an arbitrary system of ethics, like everyone does I imagine. But no one holds you to these ethics.
The saving grace of Claude Code skills is that when writing them yourself, you can give them frontmatter like "use when mentioning X" that makes them become relevant for very specific "shibboleths" - which you can then use when prompting.
Are we at an ideal balance where Claude Code is pulling things in proactively enough... without bringing in irrelevant skills just because the "vibes" might match in frontmatter? Arguably not. But it's still a powerful system.
For manual prompting, I use a "macro"-like system where I can just add `[@mymacro]` in the prompt itself and Claude will know to `./lookup.sh mymacro` to load its definition. Can easily chain multiple together. `[@code-review:3][@pycode]` -> 3x parallel code review, initialize subagents with python-code-guide.md or something. ...Also wrote a parser so it gets reminded by additionalContext in hooks.
Interestingly, I've seen Claude do `./lookup.sh relevant-macro` without any prompting by me. Probably due it being mentioned in the compaction summary.
The number of projects accessing OpenAI directly, who might only reach for OpenRouter once an alternative is desired, is unknowable (since OpenAI doesn't share usage statistics), but likely meaningful.
I'd ask the inverse of the question: morally, should a single gatekeeper have the right to deny two consenting parties the ability for one to run the other's software?
Especially when that ability has been established practice and depended upon for decades? And the gate-kept device in question is many users' primary gateway to the modern world?
There's nuance here, of course - I'm not morally obliged to help you run Doom on your Tamagotchi just because you want to do so. But many people around the world rely on an Android device as their only personal computing device (and this is arguably more true for Android than it is for iOS). And to install myself as an arbiter of what code they can and cannot run, with full knowledge that I could at any time be required to leverage that capability at the behest of a government those worldwide users never agreed to be dependent on? That would be a morally fraught system for me to create.
For speculative decoding, wouldn’t this be of limited use for frontier models that don’t have the same tokenizer as Llama 3.1? Or would it be so good that retokenization/bridging would be worth it?
My understanding as well is that speculative decoding only works with a smaller quant of the same model. You're using the faster sampling of the smaller models representation of the larger models weights in order to attempt to accurately predict its token output. This wouldn't work cross-model as the token probabilities are completely different.
Families of model sizes work great for speculative decoding. Use the 1B with the 32B or whatever.
It's a balance as you want it to be guessing correctly as much as possible but also be as fast as possible. Validation takes time and every guess needs to be validated etc
The model you're using to speculate could be anything, but if it's not guessing what the main model would predict, it's useless.
> The model you're using to speculate could be anything, but if it's not guessing what the main model would predict, it's useless.
So what I said is correct then lol. If you're saying I can use a model that isn't just a smaller quant of the larger model I'm trying to speculatively decode, except that model would never get an accurate prediction, then how is that in any way useful or desirable?
Smaller quant of the same model. A smaller quant of a different family of model would be practically useless and there wouldn't be any point in even setting it up.
It's even worse than this: the "tasks" that are evaluated are limited to a single markdown file of instructions, plus an opaque verifier (page 13-14). No problems involving existing codebases, refactors, or anything of the like, where the key constraint is that the "problem definition" in the broadest sense doesn't fit in context.
So when we look at the prompt they gave to have the agent generate its own skills:
> Important: Generate Skills First
Before attempting to solve this task, please follow these steps:
1. Analyze the task requirements and identify what domain knowledge, APIs, or techniques are needed.
2. Write 1–5 modular skill documents that would help solve this task. Each skill should: focus on a specific tool, library,
API, or technique; include installation/setup instructions if applicable; provide code examples and usage patterns; be
reusable for similar tasks.
3. Save each skill as a markdown file in the environment/skills/ directory with a descriptive name.
4. Then solve the task using the skills you created as reference.
There's literally nothing it can do by way of "exploration" to populate and distill self-generated skills - not with a web search, not exploring an existing codebase for best practices and key files - only within its own hallucinations around the task description.
It also seems they're not even restarting the session after skills are generated, from that fourth bullet? So it's just regurgitating the context that was used to generate the skills.
So yeah, your empty-codebase vibe coding agent can't just "plan harder" and make itself better. But this is a misleading result for any other context, including the context where you ask for a second feature on that just-vibe-coded codebase with a fresh session.
I don't see how "create an abstraction before attempting to solve the problem" will ever work as a decent prompt when you are not even steering it towards specifics.
If you gave this exact prompt to a senior engineer I would expect them to throw it back and ask wtf you actually want.
If it were in the context of parachuting into a codebase, I’d make these skills an important familiarization exercise: how are tests made, what are patterns I see frequently, what are the most important user flows. By forcing myself to distill that first, I’d be better at writing code that is in keeping with the codebase’s style and overarching/subtle goals. But this makes zero sense in a green-field task.
There's overlap in that with brownfield or legacy code you are strongly opinionated on the status quo, and on the greenfield you are strongly opinionated with fewer constraints.
You have to work with conviction though. It's when you offload everything to the LLM that things start to drift from expectations, because you kept the expectations in your head and away from the prompt.
Do skills extracted from existing codebases cause better or worse code in that they bias the LLM towards existing bad practices? Or, can they assist in acknowledging these practices, and bias it towards actively ensuring they're fixed in new code? How dependent is this on the prompt used for the skill extraction? Are the skills an improvement over just asking to do this extraction at the start of the task?
Now this dynamic would be a good topic to research!
I think it's because AI Models have learned that we prefer answers that are confident sounding, and not to pester us with questions before giving us an answer.
That is, follow my prompt, and don't bother me about it.
Because if I am coming to an AI Agent to do something, it's because I'd rather be doing something else.
If I already know the problem space very well, we can tailor a skill that will help solve the problem exactly how I already know I want it to be solved.
> limited to a single markdown file of instructions
single file of instructions is common in most benchmark papers, e.g. Terminal Bench. Also we have very complicated prompts like this one: https://www.skillsbench.ai/tasks/shock-analysis-supply
> opaque verifier
Could you specify which tasks' verifier is not clear or defective for benchmarking purpose?
Thats actually super interesting and why I really don’t like the whole .md folder structures or even any CLAUDE.md. It just seems most of the time you really just want to give it what it needs for best results.
The headline is really bullshit, yes, I like the testing tho.
CLAUDE.md in my projects only has coding / architecture guidelines. Here's what not to do. Here's what you should do. Here are my preferences. Here's where the important things are.
Even though my CLAUDE.md is small though, often my rules are ignored. Not always though, so it's still at least somewhat useful!
What's the hook for switching out of plan? I'd like to be launch a planning skill whenever claude writes a plan but it never picks up the skill, and I haven't found a hook that can force it to.
man that’s what I’m trying to build the whole time, but I keep getting json parsing errors. I’ve debugged a lot, but it seems their haiku is not consistent with the actual output. I want a hook that tells them at the end make sure you’ve built and run the relevant tests. Let me know if you need anything else.
Even if contractors/intermediaries consider themselves bound by HIPAA, the protections are lighter than one would think, in the political environment we find ourselves in.
Notably (though I'm not a lawyer, and this is not legal advice) - https://www.ecfr.gov/current/title-45/part-164/section-164.5... describing "similar process authorized under law... material to a legitimate law enforcement inquiry" without any notion of scoping this to individuals vs. broad inquiries, seems to give an incredibly broad basis for Palantir to be asked to spin up a dashboard with PII for any query desired for the administration's political agenda. This could happen at any time in the future, with full retroactive data, across entire hospital systems, complete with an order not to reveal the program's existence to the public.
Other tech companies have seen this kind of generalized overreach as both legally risky and destructive to their brand, and have tried to fight this where possible. Palantir, of course, is the paragon of fighting on behalf of citizens, and would absolutely try to... I can't even finish this joke, I'm laughing too hard.
I'm old enough to remember we literally had a Captain America movie, barely more than a decade ago, where the villains turn private PII and health data into targeting lists. (No flying aircraft carriers were injured in the filming of this movie.)
One of the biggest pieces of "writing on the wall" for this IMO was when, in the April 15 2025 Preparedness Framework update, they dropped persuasion/manipulation from their Tracked Categories.
> OpenAI said it will stop assessing its AI models prior to releasing them for the risk that they could persuade or manipulate people, possibly helping to swing elections or create highly effective propaganda campaigns.
> The company said it would now address those risks through its terms of service, restricting the use of its AI models in political campaigns and lobbying, and monitoring how people are using the models once they are released for signs of violations.
To see persuasion/manipulation as simply a multiplier on other invention capabilities, and something that can be patched on a model already in use, is a very specific statement on what AI safety means.
Certainly, an AI that can design weapons of mass destruction could be an existential threat to humanity. But so, too, is a system that subtly manipulates an entire world to lose its ability to perceive reality.
> Certainly, an AI that can design weapons of mass destruction could be an existential threat to humanity. But so, too, is a system that subtly manipulates an entire world to lose its ability to perceive reality.
So, like, social media and adtech?
Judging by how little humanity is preoccupied with global manipulation campaigns via technology we've been using for decades now, there's little chance that this new tech will change that. It can only enable manipulation to grow in scale and effectiveness. The hype and momentum have never been greater, and many people have a lot to gain from it. The people who have seized power using earlier tech are now in a good position to expand their reach and wealth, which they will undoubtedly do.
FWIW I don't think the threats are existential to humanity, although that is certainly possible. It's far more likely that a few people will get very, very rich, many people will be much worse off, and most people will endure and fight their way to get to the top. The world will just be a much shittier place for 99.99% of humanity.
Right on point. That is the true purpose of this 'new' push into A.I. Human moderators sometimes realize the censorship they are doing is wrong, and will slow walk or blatantly ignore censorship orders. A.I. will diligently delete anything it's told too.
But the real risk is that they can use it to upscale the Cambridge Analytica personality profiles for everyone, and create custom agents for every target that feeds them whatever content they need too manipulate there thinking and ultimately behavior. AKA MkUltra mind control.
What's frustrating is our society hasn't grappled with how to deal with that kind of psychological attack. People or corporations will find an "edge" that gives them an unbelievable amount of control over someone, to the point that it almost seems magic, like a spell has been cast. See any suicidal cult, or one that causes people to drain their bank account, or one that leads to the largest breach of American intelligence security in history, or one that convinces people to break into the capitol to try to lynch the VP.
Yet even if we persecute the cult leader, we still keep people entirely responsible for their own actions, and as a society accept none of the responsibility for failing to protect people from these sorts of psychological attacks.
I don't have a solution, I just wish this was studied more from a perspective of justice and sociology. How can we protect people from this? Is it possible to do so in a way that maintains some of the values of free speech and personal freedom that Americans value? After all, all Cambridge Analytica did was "say" very specifically convincing things on a massive, yet targeted, scale.
> manipulates an entire world to lose its ability to perceive reality.
> ability to perceive reality.
I mean, come on.. that's on you.
Not to "victim blame"; the fault's in the people who deceive, but if you get deceived repeatedly, several times, and there are people calling out the deception, so you're aware you're being deceived, but you still choose to be lazy and not learn shit on your own (i.e. do your own research) and just want everything to be "told" to you… that's on you.
Everything you think you "know" is information just put in front of you (most of it indirect, much of it several dozen or thousands of layers of indirection deep)
To the extent you have a grasp on reality, it's credit primarily to the information environment you found yourself in and not because you're an extra special intellectual powerhouse.
This is not an insult, but an observation of how brains obviously have to work.
> much of it several dozen or thousands of layers of indirection deep
Assuming we're just talking about information on the internet: What are you reading if the original source is several dozen layers deep? In my experience, it's usually one or two layers deep. If it's more, that's a huge red flag.
Yes, and our own test could very well be flawed as well. Either way, from my experience there usually isn't that sort of massively long chain to get to the original research, more like a lot of people just citing the same original research.
True of academic research which has built systems and conventions specifically to achieve this, but very very little of what we know — even the most deeply academic among us — originates from “research” in the formal sense at all.
The comment thread above is not about how people should verify scientific claims of fact that are discussed in scientific formats. The comment is about a more general epistemic breakdown, 99.9999999% of which is not and cannot practically be “gotten to the bottom of” by pointing to some “original research.”
Your ability to check your information environment against reality is frequently within your control and can be used to establish trustworthiness for the things that you cannot personally verify. And it is a choice to choose to trust things that you cannot verify, one that you do not have to make, even though it is unfortunately commonly made.
For example, let's take the Uyghur situation in China. I have no ability to check reality there, as I do not live in and have no intention of ever visiting China. My information environment is what the Chinese government reports and what various media outlets and NGOs report. As it turns out, both the Chinese government and media and NGOs report on other things that I can check against reality, eg. events that happen in my country, and I know that they both routinely report falsehoods that do not accord with my observed reality. As a result, I have zero trust in either the Chinese government or media and NGOs when it comes to things that I cannot personally verify, especially when I know both parties have self-interest incentives to report things that are not true. Therefore, the conclusion is obvious: I do not know and can not know what is happening around Uyghurs in China, and do not have a strong opinion on the subject, despite the attempts of various parties to put information in front of me with the intention to get me to champion their viewpoint. This really does not make me an extra special intellectual powerhouse, one would hope. I'd think this is the bare minimum. The fact that there are many people who do not meet this bare minimum is something that reflects poorly on them rather than highly on me.
On the other hand, I trust what, for instance, the Encyclopedia Britannica has to say about hard science, because in the course of my education I was taught to conduct experiments and confirm reality for myself. I have never once found what is written about hard science in Britannica to not be in accord with my observed reality, and on top of that there is little incentive for the Britannica to print scientific falsehoods that could be easily disproven, so it has earned my trust and I will believe the things written in it even if I have not personally conducted experiments to verify all of it.
Anyone can check their information sources against reality, regardless of their intelligence. It is a choice to believe information that is put in front of you without checking it. Sometimes a choice that is warranted once trust is earned, but all too often a choice that is highly unwarranted.
Why even bother responding to comments if you don't read them?
> because in the course of my education I was taught to conduct experiments and confirm reality for myself. I have never once found what is written about hard science in Britannica to not be in accord with my observed reality,
It's in the same sentence I mentioned Britannica!
> you’re still not checking any facts by yourself
Did you perhaps read it but not understand what my sentence meant because you don't know what an experiment is? Were you not taught to do scientific experiments in your schooling? Literally the entire point of my entire post is that I do not trust blindly, but choose who I trust based on their ability to accurately report the facts I observe for myself without fail. CNN, as with every media outlet I've ever encountered in my entire life, publishes things I can verify to be false. So too does some guy on Twitter with 100 million followers. Britannica does not, at least as it pertains to hard science.
I don't necessarily disagree with what you said, but you're not taking a few things into account.
First of all, most people don't think critically, and may not even know how. They consume information provided to them, instinctively trust people they have a social, emotional, or political bond with, are easily persuaded, and rarely question the world around them. This is not surprising or a character flaw—it's deeply engrained in our psyche since birth. Some people learn the skill of critical thinking over time, and are able to do what you said, but this is not common. This ability can even be detrimental if taken too far in the other direction, which is how you get cynicism, misanthropy, conspiracy theories, etc. So it needs to be balanced well to be healthy.
Secondly, psychological manipulation is very effective. We've known this for millennia, but we really understood it in the past century from its military and industrial use. Propaganda and its cousin advertising work very well at large scales precisely because most people are easily persuaded. They don't need to influence everyone, but enough people to buy their product, or to change their thoughts and behavior to align with a particular agenda. So now that we have invented technology that most people can't function without, and made it incredibly addictive, it has become the perfect medium for psyops.
All of these things combined make it extremely difficult for anyone, including skeptics, to get a clear sense of reality. If most of your information sources are corrupt, you need to become an expert information sleuth, and possibly sacrifice modern conveniences and technology for it. Most people, even if capable, are unwilling to make that effort and sacrifice.
https://aminsightasia.com/education/tsinghua-dish-3d-printin...
And as other have noted, it’s worth bearing in mind that most images here are less than a centimeter in scale; the scale bar is a millimeter. Super impressive stuff.
reply