Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am not convinced by this argument. It is very misleading to think that, since GPT is trained on data from the world, it must, necessarily, always produce an average of the ideas in the world. Humans have formulated laws of physics that "minimize loss" on our predictions of the physical world that are later experimentally determined to be accurate, and there's no reason to assume a language model trained to minimize loss on language won't be able to derive similar "laws" that stimulate human behavior.

In short, GPT doesn't just estimate text by looking at frequencies. GPT works so well by learning to model the underlying processes (goal-directedness, creativity, what have you) that create the training data. In other words, as it gets better (and my claim is it has already gotten to the point where it can do the above), it will be able to harness the same capabilities that humans have to make something "not in the training set".

Check out https://generative.ink/posts/simulators/ for a better treatment of this topic than I could possibly give.

Here's a relevant section of said article:

> Guessing the right theory of physics is equivalent to minimizing predictive loss. Any uncertainty that cannot be reduced by more observation or more thinking is irreducible stochasticity in the laws of physics themselves – or, equivalently, noise from the influence of hidden variables that are fundamentally unknowable.

> If you’ve guessed the laws of physics, you now have the ability to compute probabilistic simulations of situations that evolve according to those laws, starting from any conditions28. This applies even if you’ve guessed the wrong laws; your simulation will just systematically diverge from reality.

> Models trained with the strict simulation objective are directly incentivized to reverse-engineer the (semantic) physics of the training distribution, and consequently, to propagate simulations whose dynamical evolution is indistinguishable from that of training samples. I propose this as a description of the archetype targeted by self-supervised predictive learning, again in contrast to RL’s archetype of an agent optimized to maximize free parameters (such as action-trajectories) relative to a reward function.



Even very simple and small neural networks that you can easily train and play with on your laptop readily show that this “outputs are just the average of inputs” conception is just wrong. And it’s not wrong in some trickle philosophical sense, it’s wrong in a very clear mathematical sense, as wrong as 2+2=5. One example that’s been used for something like 15+ years is in using the MNIST handwritten digits dataset to recognize and then reproduce the appearances of handwritten digits. To do this, the model finds regularities and similarities in the shapes of digits and learns to express the digits as combinations of primitive shapes. The model will be able to produce 9s or 4s that don’t quite look like any other 9 or 4 in the dataset. It will also be able to find a digit that looks like a weird combination of a 9 and a 2 if you figure out how to express a value from that point in the latent space. It’s simply mathematically naive to call this new 9-2 hybrid an “average” of a 9 and a 2. If you averaged the pixels of a 9 image and a 2 image you would get an ugly nonsense image. The interpolation in the latent space is finding something like a mix between the ideas behind the shape of 9s and the shape of 2s. The model was never shown a 9-2 hybrid during training, but its 9-2 will look a lot like what you would draw if you were asked to draw a 9-2 hybrid.

A big LLM is something like 10 orders of magnitude bigger than your MNIST model and the interpolations between concepts it can make are obviously more nuanced than interpolations in latent space between 9 and 2. If you tell it write about “hubristic trout” it will have no trouble at all putting those two concepts together, as easily as the MNIST model produced a 9-2 shape, even though it had never seen an example of a “hubristic trout.”

It is weird because all of the above is obvious if you’ve played with any NN architecture much, but seems almost impossible to grasp for a large fraction of people, who will continue to insist that the interpolation in latent space that I just described is what they mean by “averaging”. Perhaps they actually don’t understand how the nonlinearities in the model architecture give rise to the particular mathematical features that make NNs useful and “smart”. Perhaps they see something magical about cognition and don’t realize that we are only ever “interpolating”. I don’t know where the disconnect is.


i think a partial explanation is that people don't move away from parametric representations of reality. We simply must be organized into a nice, neat gaussian distribution with very easy to calculate means and standard deviations. The idea that organization of data could be relational or better handled by a decision tree or whatever is not really presented to most people in school or university. Especially not as frequently or holistically as is simply thinking the average represents the middle of a distribution.

you see this across social sciences where you can see a lot of fields have papers that come out every decade or so since the 1980s saying that linear regression models are wrong because they don't take into account several concepts such as hierarchy (e.g., students go to different schools), frailty (there is likely unmeasured reasons why some people do the things they do), latent effects (there is likely non-linear processes that are more than the sum of the observations, e.g., traffic flows like a fluid and can have turbulence), auto-correlations/spatial correlations/etc.

In fact, I would argue that a decision tree based model (i.e., gradient boosted trees) will always arrive at a better solution to a human system than any linear regression. But at this point I suppose I have digressed from the original point.


I confess to the same mirror image issue. I cannot understand why people insist that regressing in a latent space, derived from the mere associative structure of a dataset, ought be given some Noble status.

It is not a model of our intelligence. It's a stupid thing. You can go and learn about animal intelligence -- and merging template cases of what's gone before, as recorded by human social detritus -- doesn't even bare mentioning.

The latent space of all the text tokens on the internet is not a model of the world; and finding a midpoint is just a trick. It's a merging between "stuff we find meaningful over here", and "stuff we find meaningful over there" to produce "stuff we find meaningful" -- without ever having to know what any of it meant.

The trick is that we're the audience, so we'll find the output meaningful regardless. Image generators don't "struggle with hands" they "struggle" with everything -- is we, the observer, who care more about the fidelity of hands. The process of generating pixels is uniformly dumb.

I don't see anything more here than "this is the thing that I know!" therefore "this is a model of intelligence!11.11!01!!" .

It's a very very bad model of intelligence. The datasets involved are egregious proxy measures of the world whose distribution has little to do with it: novels, books, pdfs, etc.

This is very far away from the toddler who learns to walk, learns to write, and writes what they are thinking about. They write about their day, say -- not because they "interpolate" between all books ever written... but because they have an interior representational life which is directly caused by their environment and can be communicated.

Patterns in our communication are not models of this process. They're a dumb light show.


I feel like our positions are probably both buried in webs of mutually-difficult-to-communicate worldview assumptions, but for what it’s worth, I care more at this point about the models being useful than being meaningful. I use GPT-4 to do complex coding and copy editing tasks. In both cases, the model understands what I’m going for. As in, I had some specific, complex, nuanced, concept or idea that I want to express, either in text or in code, and it does that. This can’t be me “projecting meaning” onto the completions because the code works and does what I said I wanted. You can call this a light show, but you can’t make it not useful.


> because the code works

The output of these systems can have arbitrary properties.

Consider an actor in a film, their speech has the apparent property, say, of "being abusive to their wife" -- but the actor isnt abusive, and has no wife.

Consider a young child reading from a chemistry textbook, their speech has apparent property "being true about chemistry".

But a professor of chemistry who tells you something about a reaction they've just performed, explains how it works, etc. -- this person might say identical words to the child, or the AI.

But the reason they say those words is radically different.

AI is a "light show" in the same way a film is: the projected image-and-sound appears to have all sorts of properties to an audience. Just as the child appears an expert in chemistry.

But these aren't actual properties of the system: the child, the machine, the actors.

This doesnt matter if all you want is an audiobook of a chemistry textbook, to watch a film, or to run some generated code.

But it does matter in a wide variety of other cases. You cannot rely on apparent properties when, for example, you need the system to be responsive to the world as-it-exists unrepresented in its training data. Responsive to your reasons, and those of other people. Responsive to the ways the world might be.

At this point the light show will keep appearing to work in some well-trodden cases, but will fail catastrophically in others -- for no apparent reason a fooled-audience will be able to predict.

But predicting it is easy -- as you'll see, over the next year or two, ChatGPT's flaws will become more widely know. There are many papers on this already.


>> I feel like our positions are probably both buried in webs of mutually-difficult-to-communicate worldview assumptions, but for what it’s worth, I care more at this point about the models being useful than being meaningful.

The question is how useful they are. With LLMs it seems they can be useful as long as you ask them to do something that a human, or another machine (like a compiler) can verify, like your example of synthesising a program that satisfies your specification and compiles.

Where LLMs will be useless is in taks where we can't verify their output. For example, I don't hear anyone trying to get GPT-4 to decode Linear A. That would be a task of significant scientific value, and one that a human cannot perform -unlike generating text or code, which humans can already do pretty damn well on their own.


>> Guessing the right theory of physics is equivalent to minimizing predictive loss.

A model can reduce predictive loss to almost zero while still not being "the right theory" of physics, or anything else. That is a major problem in science, and machine learning approaches don't have any answer to it. Machine learning approaches can be used to build more powerful predictive models, with lower error, but nothing tells us that one such model is, or even isn't, "the right theory".

As a very famous example, or at least the one I hold as a classic, consider the theory of epicyclical motion of the planets [1]. This was the commonly accepted model of the motion of the observable planets for thousands of years. It persisted because it had great predictive accuracy. I believe alternative models were proposed over the years, but all were shot down because they did not approach the accuracy of the theory of epicycles. Even Copernicus' model, that is considered a great advance because it put the Sun in the center of the universe, continued to use epicycles and so did not essentially change the "standard" model. Eventually, Kepler came along, and then Newton, and now we know why the planets seem to "double back" on themselves. And not only that, but we can now make much better predictions than we ever could do with the epicyclical model, because now we have an explanatory model, a realist model, not just an instrumentalist model, and it's a model not just of the observable motion of the planets but a model of how the entire world works.

As a side point, my concern with neural nets is that we get "stuck in a rut" with them, because of their predictive power, like we got stuck with the epicyclical model, and that we spend the next thousand years or so in a rut. That would be a disaster, at this point in our history. Right now we need models that can do much more than predict; we need models that are theories, that explain the world in terms of other theories. We need more science, not more modelling.

_________

[1] https://en.wikipedia.org/wiki/Deferent_and_epicycle


> Guessing the right theory of physics is equivalent to minimising predictive loss.

No it's not. It's minimising "predictive loss" only under extreme non-statistical conditions imposed on the data.

The world itself can be measured an infinite number of ways. There are an infinite number of irrelevant measures. There are an infinite number of low-reliability relevant measures. And so on.

Yes, you can formulate the extremely narrow task of modelling "exactly the right dataset" as loss minimization.

But you cannot model the production of that dataset this way. Data is a product of experiments.


This is just you declaring "no you can't" without supporting that in any way.

How is a theory of physics not a loss minimisation process? The history of science is literally described in these terms i.e. the Bohr model of the atom is wrong, but also so useful that we still use it to describe NMR spectroscopy.

Why did we come up with it? Because their aren't infinite ways to measure the universe, there are in fact very limited ways defined by our technology. Good ones, high loss minimisation, generally then let us build better technology to find more data.

You're invoking infinities which don't exist as a handwave for "understanding is a unique part of humanity" to try and hide that this is all metaphysical special pleading.


Alright...

What loss was being minimised to find F=GMm/r^2? Or any law of physics you like.


Gravitation was literally about predicting future positions of the stars, and was successful because it did so much better then any geocentric model. How is that not a loss minimization activity?

And before we had it, epicycles were steadily increasing in complexity to explain every new local astronomical observation, but that model was popular because it gives a very efficient initial fit of the easiest data to obtain (i.e. the moon actually does go around the Earth, and with only 1 reference point the Sun appears to go round the Earth too). But of course once you have a heliocentric theory, you can throw all those parameters and every new prediction lines up nearly perfectly (accounting for how much longer it would take before we had precise enough orbital measurements to need Relativity to fully model it).


When the law of gravitation was formulated, it could not in fact be used to predict orbits reliably (Kepler's ellipses are the solution to the two body problem anyways, and for a more complex system integration was impossible to any useful precision at the time), and Kepler's theories came out long before it did.

It took more than 70 years after its formulation for the law to actually be conclusively tested against observations in a conclusive manner.


Also note that Copernicus' heliocentric model retained the geocentric model's epicycles on circular orbits. It really took Kepler to make a better model. And it was better because it was explanatory to boot, and not only predictive.

At some point, the metaphor of "loss minimisation" starts to break down. When we're talking about science, there's much more we want to do than minimise some loss function- that nobody has ever written down anyway. We want to be able to say "this is how the world works". The language of function optimisation is simply not the right language to do anything like that.

Even Vladimir Vapnik turned to poetry to try and increase the information available to statistical learners. Let me see if I can find that paper...


Sure but it was a better fit, and before that heliocentric models were definitely the only way forward that didn't keep adding terms every time someone spotted a moon.

Occam's razor - do not multiply terms without necessity - is essentially a loss function.


You're talking about Kepler's model here, not about the gravitational equation. The gravitational equation was not a better fit than Kepler at that time, especially since it used unknown constants.


So would you care to comment on how this relates to the original contention, which is the claim that a loss function could not discover Newton's law of gravitation?

Because what you're arguing, extensively, is that due to lack of fit, Newton's Law of Gravitation wasn't settled science until observational data was of sufficient fidelity to clearly distinguish it.

Which sure sounds like a loss function.


Formulate the loss function -- you'll find it's just

    loss(the-right-answer(perfect-x) - perfect-y)
The most important aspect of "the-right-answer" is its ability to ignore almost all the data.

The existence of planets is "predictable" from the difference between the data and the theory -- if the theory is just a model of the data, it has no capacity to do this.

If you want to "do physics" by brute force optimization you'd need to have all possible measures, all possible data, and then a way of selecting relevant causal structures in that data -- and then able to try every possible model.

    loss(Model(all-data|relevant-causal-structures) - Filter(...|...))) forall Model 
Of course, (1) this is trivially not computable (eqv. to computing the reals) -- (2) "all possible data with all possible measures" doesn't exist and (3) selecting relevant causal structure requires having a primitive theory not derived from this very process

animals solve this in reverse order: (3) is provided by the body's causal structure; (2) is obtained by using the body to experiment; and (1) we imagine simulated ways-the-world-might-be to reduce the search space down to a finite size.

ie., we DO NOT make theories out of data. We first make theories then use the data to select between them.

This is necessary, since a model of the data (ie., modern AI, ie., automated statistics, etc.) doesnt decide between an infinite number of theories of how the data came to be.


> ie., we DO NOT make theories out of data. We first make theories then use the data to select between them.

No we don't, we make hypotheses and then test them. Hypotheses are based on data.

There are physics experiments being done right now where the exact hope is that existing theory has not predicted the result they produce, because then we'd have data to hypothesis something new.[1]

You are literally describing what deep learning techniques are designed to do while claiming they can't possibly do it.

[1] https://www.scientificamerican.com/article/measurement-shows...


Hypotheses are "based" on data in the sense that via imagination we simulate ways the world might be, and then "data" is a clue to a contradiction.

Deep learning models are data: they are just associations between points.

Train a NN on data generated from an exponential function, and the model produced is not exponential.

Train a NN on the covid pandemic, and you will never obtain the SIR model.

AI is just associative statistical modelling. The model is the data.


I know this discussion is a bit old at this point, but I came across this[1] essay for the first time today, and this shows more of what I was trying to get across earlier in the thread. Hopefully you'll find it interesting. Essentially, they trained a GPT on predicting the next move in a game of Othello, and by analyzing the weights of the network, found that the weights encode an understanding of the game state. Specifically, given an input list of moves, it calculates the positions of its own pieces and that of the opponent (a tricky task for a NN given that Othello pieces can swap sides based on moves made on the other side of the board). Doing this allowed it to minimize loss. By analogy, it formed a theory about what makes moves legal in Othello (in this case, the positions of each player's pieces), and found out how to calculate those in order to better predict the next move.

[1] https://www.neelnanda.io/mechanistic-interpretability/othell...


Proving any given AI architecture can't do something doesn't prove all AI architectures forever will never be able to do something. Neural networks aren't all AI, they're not even "neural networks" since the terms wraps up a huge amount of architectural and design choices and algorithms.

Unless you believe in the soul, then the human brain is just a very complicated learning architecture with a specific structure (which we freely know doesn't operate like existing systems...sort of, of course we also don't know that it's not just a convoluted biological path to emulating them for specific subsystems either).

But even your original argument is focused on just playing with words to remove meaning: calling something data doesn't meaningfully make your point, because mathematical symbols are just "data" as well.

Mathematics has no requirement to follow any laws you think it does - 1 + 1 can mean whatever we want, and its a topic of discussion as to why mathematics describes the physical world at all - which is to say, it's valid to say we designed mathematics to follow observed physics.


The whole point is that Newton came up with the law before there was observational data that could prove it, which is fundamentally different from regression. The data is used to reject the theory, not to form it, here.


I get the feeling that the OP is using "loss function" in the figurative sense, and not in the sense of an actual loss function that is fit to observations. We know nobody did that in Newton's time. In Newton's time they didn't even have the least squares method, let alone fit a model to observations by optimising a loss function.


Yes, I'm also using it in the figurative sense. It's not a regression model, the models are developped and then the data is sought out to infirm them. It's the reverse for a regression technique. The model being generated before the data that can support it is a big part of how humans come up with these models and it's fundamentally different in many ways.


What are you talking about? If scientific models aren't developed based on data, then what are they developed based on? Divine inspiration?

No. Very obviously no. The multi-post diversion about Kepler's laws is explicitly evidence to the contrary since Kepler's laws are a curve fitting exercise which matches astronomical data in a specific context but doesn't properly describe the underlying process - i.e. their predictive power vanishes once the context changes. But they do simplify down to Newton's Law once the context is understood.

New data is sought out for models to determine whether they are correct because a correct model has to explain existing data and predict future data. The Bohr Model of the atom was developed because it explained the emission spectra of hydrogen well. It's not correct because it doesn't work anything but hydrogen...but it's actually correct enough that if you're doing nuclear magnetic resonance (which is very hydrogen-centric for organic molecules) then it is in fact good enough to predict and understand spectra with (at least in 1D, 3D protein structure prediction is it's own crazy thing).

This is the entire point of deep learning techniques. The whole idea of latent space representations is that they learn underlying structural content of the data which should include observations about reality.


That's not how the scientific process works. You use your intuition to make a theory, sometimes loosely based on data, and then you come up with an experiment to test it.

We both agree that Kepler was trying to fit curves. But that's not what Newton was trying to do. Newton was trying to explain. Newton's model did not fit the data better than Kepler's model until far after they both died.

Newton's model, to Newton had more loss than Kepler's model.

But it turned out 70 years later that Newton's model was better, because it's only then that there was any data for which it was a better prediction.

You're similarly wrong about Bohr. If all you were interested was to find the emission spectra of hydrogen, there's absolutely no reason you'd try to come up with the Bohr model. Why? Because Rydberg already made a formula that predicted the emission spectra of Hydrogen, 25 years earlier.

The entire point of Bohr's model and of Newton's model is that they weren't empirically better at predicting the phenomena. Indeed, simple curve fitting came up with equations that are far better in practice, earlier.

But they were better at explaining the phenomena.

And that only became relevant because after we had these models, we came up with new experiments, informed by these models, which helped us understand them and eventually push them behind the breaking point.

It's not a curve fitting experiment. We already had better curve fitting models far before either of those was invented. If your goal was to reduce the loss, they'd be useless and there would be no point coming up with them.

That's the difference between the scientific method and mere regression.


> That's not how the scientific process works. You use your intuition to make a theory,

Go ahead and define what "intuition" is? Why do people have it? Why is some people's intuition better then others?


(Not the OP) We don't know ho;w the human mind works, or how "intuitions" or "inspiration" come about, but that's no reason to call them "metaphysics". Clearly, they are physical processes that somehow take place in the human brain.

The questions you ask in this comment are good questions, for which we have no good answers. That doesn't mean there's anything supernatural going on, or that anyone is assuming something supernatural is happening. We just don't know how human scientists come up with new hypotheses, that's all there is to it.

But it's not like there's some kind of principled way to do it. There's no formulae, no laws, where we can plug in some data and out pops a hypothesis ready for the testing. Maybe we will find how to define such laws or formulae at some point, but for now, all we got is some scientist waking up one day going "holy cow, that's it!". And then spending the next ten years trying to show that's what it really is.


To clarify, the OP is pointing out that it wasn't Newton's law of universal gravitation that defeated the epicyclical model of the cosmos.

It was Kepler's laws of planetary motion that did for epicycles; and that happened 70 ish years before Newton stated his laws of motion and pointed out that they basically subsume Kepler's laws of planetary motion.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: