I should clarify that LLMs trained on the internet are necessarily a dead end, theoretically, because the internet both (1) lacks specialist knowledge and knowledge that cannot be encoded in text / language, and (2) is polluted with not just false, but irrelevant knowledge for general tasks. LLMs (or rather, transformers and deep models tuned by gradient descent) trained on synthetic data or more curated / highly-specific data where there are actual costs / losses we can properly model (e.g. AlphaFold) could still have tremendous potential. But "LLM" in the usual, everyday sense in which people use this label, are very limited.
A good example would be trying to make an LLM trained on the entire internet do math proofs. Almost everything in its dataset tells it that the word "orthogonal" means "unrelated to", because this is how it is used colloquially. Only in a tiny amount of math forums / resources it digested does this actually mean something about the dot product, so clearly an LLM that does math well only does so by ignoring the majority of the space it is trained on. Similar considerations apply for attempting to use e.g. vision-language models trained on "pop" images to facilitate the analysis of, say, MRI scans, or LIDAR data. That we can make some progress in these domains tells us there is some substantial overlap in the semantics, but it is obvious there are limits to this.
There is no reason to believe these (often: irrelevant, incorrect) semantics learned from the entire web are going to be helpful for the LLM to produce deeply useful math / MRI analysis / LIDAR interpretation. Broadly, not all semantics useful in one domain are useful in another, and, even more clearly, linguistic semantics clearly have limited relevance to much of what we consider intelligence (which includes visual, auditory, proprioceptive/kinaesthetic, and, arguably, mathematical abstractions). But, it could well be that curve-fitting huge amounts of data from the relevant semantic space (e.g. feeding transformers enough Lean / MRI / LIDAR data) is in fact all we need, so that e.g. transformers are "good enough" for achieving most basic AI aims. It just is clearly the case that the internet can't provide all that data for all / most domains.
EDIT: Also Anthropic's writeups are basically fraud if you actually understand the math, there is no "thinking ahead" or "planning in advance" in any sense, literally just if you head down certain paths due to pre-training, yes, of course, you can "already see" weight activations of future tokens: this is just what curve-fitting in N-D looks like, there is no where else for the model to go. Actual thinking ahead means things like backtracking / backspace tokens, i.e. actually retracing your path, which current LLMs simply cannot do.
> so clearly an LLM that does math well only does so by ignoring the majority of the space it is trained on
There are probably good reasons why LLMs are not the "ultimate solution", but this argument seems wrong. Humans have to ignore the majority of their "training dataset" in tons of situations, and we seem to do it just fine.
It isn't wrong, just think about how weights are updated via (mini-)batches, and how tokenization works, and you will understand that LLM's can't ignore poisoning / outliers like humans do. This would be a classic recent example (https://arxiv.org/abs/2510.07192): IMO because the standard (non-robust) loss functions allow for anchor points .
A good example would be trying to make an LLM trained on the entire internet do math proofs. Almost everything in its dataset tells it that the word "orthogonal" means "unrelated to", because this is how it is used colloquially. Only in a tiny amount of math forums / resources it digested does this actually mean something about the dot product, so clearly an LLM that does math well only does so by ignoring the majority of the space it is trained on. Similar considerations apply for attempting to use e.g. vision-language models trained on "pop" images to facilitate the analysis of, say, MRI scans, or LIDAR data. That we can make some progress in these domains tells us there is some substantial overlap in the semantics, but it is obvious there are limits to this.
There is no reason to believe these (often: irrelevant, incorrect) semantics learned from the entire web are going to be helpful for the LLM to produce deeply useful math / MRI analysis / LIDAR interpretation. Broadly, not all semantics useful in one domain are useful in another, and, even more clearly, linguistic semantics clearly have limited relevance to much of what we consider intelligence (which includes visual, auditory, proprioceptive/kinaesthetic, and, arguably, mathematical abstractions). But, it could well be that curve-fitting huge amounts of data from the relevant semantic space (e.g. feeding transformers enough Lean / MRI / LIDAR data) is in fact all we need, so that e.g. transformers are "good enough" for achieving most basic AI aims. It just is clearly the case that the internet can't provide all that data for all / most domains.
EDIT: Also Anthropic's writeups are basically fraud if you actually understand the math, there is no "thinking ahead" or "planning in advance" in any sense, literally just if you head down certain paths due to pre-training, yes, of course, you can "already see" weight activations of future tokens: this is just what curve-fitting in N-D looks like, there is no where else for the model to go. Actual thinking ahead means things like backtracking / backspace tokens, i.e. actually retracing your path, which current LLMs simply cannot do.