as long as it's new I tremendously enjoy binge watching Claude:
I have three tabs open and if one of them is not doing something interesting I just switch to a different channel, and occasionally influenced the narrative
The article basically claims that LLMs are bad at politics and poker which is both not true (at least if they receive some level of reinforcement learning after sweep training)
you are living in the past these models have been trained on image data for ages, and one interesting find was that even before that they could model aspects of the visual world astonishingly well even though not perfect just through language.
Counterpoint: Try to use an LLM for even the most coarse of visual similarity tasks for something that’s extremely abundant in the corpus.
For instance, say you are a woman with a lookalike celebrity, someone who is a very close match in hair colour, facial structure, skin tone and body proportions. You would like to browse outfits worn by other celebrities (presumably put together by professional stylists) that look exactly like her. You ask an LLM to list celebrities that look like celebrity X, to then look up outfit inspiration.
No matter how long the list, no matter how detailed the prompt in the features that must be matched, no matter how many rounds you do, the results will be completely unusable, because broad language dominates more specific language in the corpus.
The LLM cannot adequately model these facets, because language is in practice too imprecise, as currently used by people.
To dissect just one such facet, the LLM response will list dozens of people who may share a broad category (red hair), with complete disregard to the exact shade of red, whether or not the hair is dyed and whether or not it is indeed natural hair or a wig.
The number of listicles clustering these actresses together as redheads will dominate anything with more specific qualifiers, like ’strawberry blonde’ (which in general counts as red hair), ’undyed hair’ (which in fact tends to increase the proportion of dyed hair results, because that’s how linguistic vector similarity works sometimes) and ’natural’ (which again seems to translate into ’the most natural looking unnatural’, because that’s how language tends to be used).
You've clearly never read an actual paper on the models and understand nothing about backbones, pre-training, or anything I've said in my posts in this thread. I've made claims far more specific about the directionality of information flow in Large Multimodal Models, and here you are just providing generic abstract claims far too vague to address any of that. Are you using AI for these posts?
They model the part of the world that (linguistic models of the world posted on the internet) try to model. But what is posted on the internet is not IRL. So, to be glib: LLMs trained on the internet do not model IRL, they model talking about IRL.
His point is that human language and the written record is a model of the world, so if you train an LLM you're training a model of a model of the world.
That sounds highly technical if you ask me. People complain if you recompress music or images with lossy codecs, but when an LLM does that suddenly it's religious?
An LLM has an internal linguistic model (i.e. it knows token patterns), and that linguistic model models humans' linguistic models (a stream of tokens) of their actual world models (which involve far, far more than linguistics and tokens, such as logical relations beyond mere semantic relations, sensory representations like imagery and sounds, and, yes, words and concepts).
So LLMs are linguistic (token pattern) models of linguistic models (streams of tokens) describing world models (more than tokens).
It thus does not in fact follow that LLMs model the world (as they are missing everything that is not encoded in non-linguistic semantics).
At this point, anyone claiming that LLMs are "just" language models aren't arguing in good faith. LLMs are a general purpose computing paradigm. LLMs are circuit builders, the converged parameters define pathways through the architecture that pick out specific programs. Or as Karpathy puts it, LLMs are a differentiable computer[1]. Training LLMs discovers programs that well reproduce the input sequence. Tokens can represent anything, not just words. Roughly the same architecture can generate passable images, music, or even video.
No, its extremely silly to use the incidental name of a thing as an argument for the limits of its relevance. LLMs were designed to model language, but that does not determine the range of their applicability, or even the class of problems they are most suited for. It turns out that LLMs are a general computing architecture. What they were originally designed for is incidental. Any argument that starts off "but they are language models" is specious out of the gate.
Sorry, but using "LLM" when you mean "AI" is a basic failure to understand simple definitions, and also is ignoring the meat of the blog post and much of the discussion here (which is that LLMs are limited by virtue of being only / mostly trained on language).
Everything you are saying is either incoherent because you actually mean "AI" or "transformer", or is just plain wrong, since e.g. not all problems can be solved using e.g. single-channel, recursively-applied transformers, as I mention elsewhere here: https://news.ycombinator.com/item?id=46948612. The design of LLMs absolutely determines the range of their applicability, and the class of problems they are most suited for. This isn't even a controversial take, lots of influencers and certainly most serious researchers recognize the fundamental limitations of the LLM approach to AI.
You literally have no idea what you are talking about and clearly do not read or understand any actual papers where these models are developed, and are just repeating simplistic metaphors from blog posts, and buying into marketing.
In this case this is not so. The primary model is not a model at all, and the surrogate has bias added to it. It's also missing any way to actually check the internal consistency of statements or otherwise combine information from its corpus, so it fails as a world model.
you are correct the token representation gets abstracted away very quickly and is then identical for textual or image models. It's the so-called latent space and people who focus on next token prediction completely missed the point that all the interesting thinking takes place in abstract world model space.
> you are correct the token representation gets abstracted away very quickly and is then identical for textual or image models.
This is mostly incorrect, unless you mean "they both become tensor / vector representations (embeddings)". But these vector representations are not comparable.
E.g. if you have a VLM with a frozen dual-backbone architecture (say, a vision transformer encoder trained on images, and an LLM encoder backbone pre-trained in the usual LLM way), then even if, for example, you design this architecture so the embedding vectors produced by each encoder have the same shape, to be combined via another component, e.g. some unified transformer, it will not be the case that e.g. the cosine similarity between an image embedding and a text embedding is a meaningful quantity (it will just be random nonsense). The representations from each backbone are not identical, and the semantic structure of each space is almost certainly very different.
someone might build enough hype around it to challenge the oligopoly.
As for training on existing browsers: they are trained on the whole corpus of the human thought process and can take many insights from other fields into the browser, they can write a browser in a completely new language without just transpiring but building it from first principles or as karparthy calls it spec based programming.
this is the whole message of this hype that you can churn out 500 commits a day relatively confidently the way you have clang churn out 500 assemblies without reading them. We might not be 100% there but the hype is looking slightly into the future and even though I don't see the difference to Claude code, I tend to agree that this is the new way to do things even if something breaks on average it's safe enough
reply