Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Alan Turing's tape machine + neuron model.

In the human brain, Neurons store an incredible amount of information. Neuron models in neural networks only did so with weights.

There is still a lack of understanding on how the human brain does it. Deep Mind grabbed a proven memory model from Alan Turing's work and applied it to the feature barren neuron models in use. Sprinkle magic ...

They are not operating on another level, they're bringing over features that are well documented in the human brain and in white papers from a past period when people actually thought deeply about this problem and applying it.

https://en.wikipedia.org/wiki/Bio-inspired_computing

There is no 'intuition' about the architecture. Study the human brain and copy pasta into the computing realm.

Others are doing this as well. If anyone bothered to read the white papers people publish, you'll see that many people have presented similar ideas over the years.

You can come up with your own neural Turing machine. Take a featureless neuron model, slap a memory module on it and you have a neural turing machine.



In order to use a turing machine in a neural network - or at least to train it, in any way that isn't impractical and/or cheating - you need to make it differentiable somehow.

Graves and co. have been really creative in overcoming problems in their ongoing program to differentiate ALL the things.


In this context what does differentiable mean?


I think the easiest way to see this is by an example of a non-differentable architecture.

Let's suppose on the current training input, the network produces some output that is a little wrong. It produced this output by reading a value v at location x of memory.

In other words, output = v = mem[x]

It could be wrong because the value in memory should have been something else. In this case, you can propagate the gradient backwards. Whatever the error was at the output, is also the error at this memory location.

Or it could be wrong because it read from the wrong memory location. Now you're a bit dead in the water. You have some memory address x, and you want to take the derivative of v with respect to x. But x is this sort of thing that jumps discretely (just as an integer memory address does). You can't wiggle x to see what effect it has on v, which means that you don't know which direction x should move in in order to reduce the error.

So (at least in the 2014 paper, ignoring the content-addressed memory), memory accesses don't look like v = mem[x]. They look like v = sum_i(a_i * mem[i]). Any time you read from memory, you're actually reading all the memory, and taking a weighted sum of the memory values. And now you can take derivatives with respect to that weighting.

To me, the question this raises is, what right do we have to call this a Turing machine. This is a very strong departure from Turing machines and digital computers.


Turing didn't specify how reads and writes happened on the tape. For the argument he was making it was clearer to assume there was no noise in the system.

As for "digital" computers remember they are built out of noisy physical systems. Any bit in the CPU is actually a range of voltages that we squash into the abstract concept of binary.


I don't think that is really relevant to the discussion. Regardless of how a digital computer is physically implemented, we use it according to specification. We concretize the concept of binary by designing the machine to withstand noise. The thing what we get when we choose the digital abstraction is that this is actually realistic. Digital computers pretty much operate digitally. Corruption happens, but we consider that an error, and we try to design so that a programmer designing all but the most critical of applications, should assume that memory does not get corrupted

We don't squash the range of voltages. The digital component that interprets that voltage does the squashing. And we design it that way purposefully. https://en.wikipedia.org/wiki/Static_discipline

Turing specified that the reads and the writes are done by heads, which touch a single tape position. You can have multiple (finitely many) tapes and heads, without leaving the class of "Turing machine". But nothing like blending symbols from adjacent locations on the tape, or requiring non-local access to the tape.


No wonder Google built (is building) custom accelerators in hardware. This points to a completely different architecture from Von Neumann, or at least it points to MLPUs, Machine Learning Processing Units.


Pardon my ignorance as I'm not super knowledgeable on this, but is what you described around reading all the memory and taking the weighted sum of values similar in a sense to creating a checksum to compare something against?


I suppose I can see the similarity, in that there's some accumulated value (the sum) from reading some segment of memory, but otherwise I don't think the comparison is helpful.


It means it can be trained by backpropagating the error gradient through the network.

To train a neural network, you want to know how much each component contributed to an error. We do that by propagating the error through each component in reverse, using the partial derivatives of the corresponding function.


Dont forget you first need to understand the mathemetical theory of how a brain does computation and pattern recognition. Off course they look into how a brain does it. But the mathematical underpinnings, and how the information flows is much more important than how an individual neuron works in real live. Abstraction and applying it to real data is what they are doing.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: