More

rkaplan · on Sept 29, 2020

This is the correct answer. A key reason Rebco is so much better than older alternatives is that it can be cooled to superconductivity using only liquid nitrogen (77K), as opposed to liquid helium, which is much harder to work with.

foobarian · on Sept 29, 2020

Nit: ultimately the new magnets need to be stronger for the same volume no matter how easy the cooling is to work with. That part is just a nice bonus.

rkaplan · on July 7, 2020

Longtime DVC user here - this is going to be so helpful. We use DVC for all of our model and data versioning, but what's been missing is the ability to cleanly integrate that into our CI workflow. Looks like that's solved now! The cml.yaml syntax also looks quite nice, very easy to follow. Looking forward to trying this out.

rhythmvertigo · on July 7, 2020

Really glad to hear that, rkaplan. Let us know if we can be of any help!

rkaplan · on Feb 20, 2018

Great anecdote from Jeff Dean's twitter [1] about how they thought to try this experiment, given that doctors didn't know such a thing was possible:

"Funny story. We had a new team member joining, and @lhpeng suggested an orientation project for them of "Why don't you just try predicting age and gender from the images?" to get them familiar with our software setup, thinking age might be accurate within a couple of decades,and gender would be no better than chance. They went away and worked on this and came back with results that were much more accurate than expected, leading to more investigation about what else could be predicted."

[1]: https://twitter.com/JeffDean/status/965720435290791936

gwern · on Feb 20, 2018

Something to keep in mind the next time someone assures you that 'gaydar' is pseudoscience and totally impossible and any paper must be datamining - everything is correlated. We are routinely surprised what can be extracted from data, and anyone telling you a priori what cannot be done is pushing their politics.

rkaplan · on Nov 6, 2017

This is not "AI that builds AI". The actual research behind AutoML is called NASNet (https://arxiv.org/pdf/1707.07012.pdf), and all it is simply: we found two good neural network layers (called NASNet normal cells / reduction cells in the paper) that work well on many different image datasets. It's a very cool research result. But it's not something that will replace AI researchers.

Eridrus · on Nov 6, 2017

This is not the entire field of AuotML or even the entirety of Google's published research.

ollin · on Nov 6, 2017

Yeah, I'm confused that this is the top comment; it's factually incorrect. NASNet is an example of a result of AutoML. To quote the Google blogpost on NASNet:

>In Learning Transferable Architectures for Scalable Image Recognition, we apply AutoML to the ImageNet image classification and COCO object detection dataset... AutoML was able to find the best layers that work well on CIFAR-10 but work well on ImageNet classification and COCO object detection. These two layers are combined to form a novel architecture, which we called “NASNet”.

[https://research.googleblog.com/2017/11/automl-for-large-sca..., November 2017]

In contrast AutoML is, as the nytimes article describes, "a machine-learning algorithm that learns to build other machine-learning algorithms". More specifically, from the Google blogpost about AutoML:

>In our approach (which we call "AutoML"), a controller neural net can propose a “child” model architecture, which can then be trained and evaluated for quality on a particular task...Eventually the controller learns to assign high probability to areas of architecture space that achieve better accuracy on a held-out validation dataset, and low probability to areas of architecture space that score poorly.

[https://research.googleblog.com/2017/05/using-machine-learni..., May 2017]

Quoc, Barret, and others have been working on ANN-architecture-design systems for a while now (see: https://arxiv.org/abs/1611.01578), and AutoML specifically was done before announcing NASNet. Saying that NASNet is "the actual research behind AutoML" is drawing the causal arrow backwards.

machinelearning · on Nov 6, 2017

It takes little imagination to see how methods used for designing neural networks can be applied to other parts of ML (e.g. optimization, feature selection etc.) AutoML is definitely not a subset of NASNet..

justonepost · on Nov 6, 2017

Yeah, it's pretty sad the nytimes is writing such facile clickbait. I mean, don't these articles have to pass some kind of review? Makes you wonder about other science based articles they publish.

visarga · on Nov 6, 2017

Some of the learning to learn models do much more than fine-tuning parameters, they can even discover novel architectures. On the other hand, using meta-learning can be a way to check if human generated solutions are up to par, because random search can be more thorough and even try absurd ideas that might work out.

In programming we have tons of automation as well and we haven't ditched the programmer yet. Programming is auto-cannibalizing itself since its inception, each language automating more of our work. Even in ML, 10 years ago it was necessary to create features by hand. This required a lot of expertise. Today it's been automated by DL, but we have more AI scientists than ever and the jobs are even better paid.

So I don't think meta-learning is a fluff idea, and we don't have to fear it replacing humans yet. Instead, it will make AI more robust. The only minus I see is that it requires a lot of compute, but we can rent that from the cloud (make an architecture search for a few thousand dollars), we don't need to fork millions of dollars like the big labs who own their hardware. And we don't need this kind of intensive DL all the time, just once maybe, for a project. After we find the best architecture and hyperparameters, we can use that and train normally. By collating meta-learning data across many projects, we can make training faster and cheaper, reusing insight gained before.

lemondrops · on Nov 6, 2017

Creating features has not been automated by Deep Learning at all. Even for image recognition tasks, where your "features" are simply the pixels of an image, there's still lots of preprocessing work to get those images into a form that NNs can deal with well.

Feature engineering is actually still the hardest part of most ML tasks, because it can not be optimized by a simple grid search like the hyperparameters of a model.

rkaplan · on July 26, 2017

The AI Grant fellowship has been an awesome experience so far; I highly recommend applying if you're on the fence. The grant itself is useful (and this round includes $20k of Google cloud credits!) but equally valuable is the community that Nat has curated around the project. There are a lot of great folks working on AI from all sorts of places just a Slack message away. Now that Daniel Gross (YC's AI-focused partner) is involved as well, I'd expect that community to continue to grow.

rmeertens · on July 26, 2017

Care to elaborate on your experience? What project is yours? Also: what do you think it takes to get a project accepted?

rkaplan · on June 5, 2017

This post doesn't even mention the easiest way to use deep learning without a lot of data: download a pretrained model and fine-tune the last few layers on your small dataset. In many domains (like image classification, the task in this blog post) fine-tuning works extremely well, because the pretrained model has learned generic features in the early layers that are useful for many datasets, not just the one trained on.

Even the best skin cancer classifier [1] was pretrained on ImageNet.

[1]: http://www.nature.com/articles/nature21056

a_bonobo · on June 5, 2017

This is how the great fast.ai course begins - download VGG16, finetune the top layer with a single dense layer, get amazing results. The second or third class shows how to make the top layers a bit more complex to get even better accuracy.

lugg · on June 5, 2017

Saving someone a Google.

http://course.fast.ai

alexcnwy · on June 5, 2017

Can't recommend the course highly enough!

yamaneko · on June 5, 2017

I'm skimming through the content and it seems really great! I'm interested in the last lesson (7-Exotic CNN Arch), but I'm afraid of missing other cool stuff in past lessons.

What do you suggest for someone who has experience with Deep Learning?

EDIT: found this wiki with the course notes: http://wiki.fast.ai/index.php/Main_Page

One can use it as a guide to avoid missing anything.

dirtyaura · on June 5, 2017

I think one of the strengths of the course is that Jeremy shows parts of the process of working on a ML problem. If you have time, I recommend watching earlier lessons, even if you know the theoretical aspects of the content covered.

manibatra · on June 5, 2017

Same!! On lesson 2 and already feel like I know so much! The top down approach to learning is great! Will actually read the book "Making Learning Whole" that inspired them to follow this approach.

Steeeve · on June 5, 2017

The 30 minute overview has me excited about going through it. I really like how they have structured it. Thanks to all for recommending it!

alexcnwy · on June 5, 2017

Totally agree.

Similarly for word embeddings like word2vec, GLoVE, fasttext etc in the case of NLP.

I think this is fundamental - if you teach a human how to recognize street signs, you don't need to show them millions of examples - just one or a few of each is enough because we build on reference experiences of past objects seen through life experience to encode the new images as memories.

131012 · on June 5, 2017

Would you care to elaborate?

My main doubt comes the fact that language meaning may vary between different contexts, but I am no expert and am earnestly curious about using NLP and ML with not-that-big data.

JacobiX · on June 5, 2017

Vector representation are useful for many natural language processing tasks. In word embedding like word2vec, GLoVE, fasttext for each word the algorithm learns an associated vector in dimension n (x1, x2, .., xn). A word maybe close to another in one dimension (or a subspace) but far away in another one. Moreover good representation allows meaningful vector space arithmetic: Queen - Women == King word representations are typically trained on very large unlabeled data, but once the algorithm learns the features you can use them for your small dataset. EDIT: Add more explanations.

rjtavares · on June 5, 2017

This blog post[1] is a great example of positive unintended consequences of deep learning:

> We were very surprised that our model learned an interpretable feature, and that simply predicting the next character in Amazon reviews resulted in discovering the concept of sentiment.

[1]: https://blog.openai.com/unsupervised-sentiment-neuron/

ec109685 · on June 5, 2017

It is mentioned at the end of the post:

> You don’t need Google-scale data to use deep learning. Using all of the above means that even your average person with only a 100-1000 samples can see some benefit from deep learning. With all of these techniques you can mitigate the variance issue, while still benefitting from the flexibility. You can even build on others work through things like transfer learning.

bayonetz · on June 5, 2017

To be fair to the original article, his assertion was more along the lines of you can't train a deep net without lots of data. As the second article shows, that isn't true in the general case. However, it is certainly true for creating any of the interesting models you think of when you think of deep nets (i.e., Inception, word2vec, etc.). You just can't the richness of these without a lot of data to train them.

autokad · on June 5, 2017

Deep learning can do pretty well when its not pre-trained as well.

I have this data set that is word counts for top 5k words, 5000 observations training, 5000 hold out. I consider this data pretty small.

SVM with rbf kernal can get around 87-88% accuracy, but a histogram kernal can get around 89.7% accuracy with a little feature engineering.

Tensorflow, after tuning some parameters, can also get around 89.7% accuracy as well.

shreezus · on June 5, 2017

Transfer learning is efficient (minimal training time) and useful for most classification tasks across various domains. Some of the first models I've used were built on Inception/ImageNet and I recall being thoroughly impressed by the performance.

ganwar · on June 9, 2017

This only works though if the pretraining and your training, both are with the data in the same domain. Even in that you will have issues if the data was from same domain but different in representation, e.g. 2D and 3D image datasets.

_flbt · on June 5, 2017

What would you consider the best resource for learning how to do this in Python? I have a smaller set of image data where I'd like to identify the components of (i.e. 'house', 'car', etc.).

throw_away · on June 5, 2017

Watch the very first 1 or 2 videos from http://fast.ai . They show how to do this in about seven lines of python and five minutes worth of model training.

option_greek · on June 5, 2017

Does transfer learning apply to seq2seq as well ?

rkaplan · on March 2, 2017

"In contrast to deep neural networks which require great effort in hyper-parameter tuning, gcForest is much easier to train."

Hyperparameter tuning is not as much of an issue with deep neural networks anymore. Thanks to BatchNorm and more robust optimization algorithms, most of the time you can simply use Adam with a default learning rate of 0.001 and do pretty well. Dropout is not even necessary with many models that use BatchNorm nowadays, so generally tuning there is not an issue either. Many layers of 3x3 conv with stride 1 is still magical.

Basically: deep NNs can work pretty well with little to no tuning these days. The defaults just work.

computerex · on March 2, 2017

I couldn't disagree more. The defaults don't just work, and the architecture of the network could also be considered a hyper parameter in which case what would be a reasonable default for all the types of problems ANN are used for?

ipunchghosts · on March 2, 2017

Are you using batch normalization? If you are, an issue I see all the time is folks not setting the EMA filter coef correctly. In keras, it defaults to something like 0.99 which in my mind makes no sense. I use something around 0.6 and life is good. You want to get an overall good measurement of the statistics and in my mind the frequency cutoff when coef=0.99 is just way too high for most application. You usually want something that filters out just about everything except very close to DC.

gcr · on March 2, 2017

The response to "the defaults should work just fine without any hyperparameter tuning" is "try fiddling with the EMA filter coefficient hyperparameter" ?

(Just poking fun. :P)

joe_the_user · on March 2, 2017

It's like the joke of the mathematician giving an exposition of a complex proof. At one point he says "It is obvious that X", pauses, scratches his head, does a few calculations. Leaves room for twenty minutes and returns. Then continues "it is obvious that X" and goes to the next step.

Deep in the field, it's fine for machine learning experts to say "everything just works" [if you've mastered X, Y, Q esoteric fields and tuning methods] since they're welcome to "humble brag" as much as they want. But when this gets in the way of figuring out what really "just works" it's more of a problem.

dirtyaura · on March 2, 2017

Interesting, totally new concept for me: Where can I read more about EMA filter coefficient in Keras? My Google-fu is failing.

walrus · on March 2, 2017

I think they're referring to the momentum parameter at [1]. The exponential moving average (EMA) of the batch mean/variance is used in the batch normalizing transform (Algorithm 1 in [2]).

The momentum ranges from 0 to 1. If it's close to 1, which the default of 0.99 is, the EMA of the batch mean/variance will change slowly across batches. If it's close to 0, the EMA will be close to the mean/variance of the current batch.

The EMA acts as a low-pass filter. With a momentum close to 1, the EMA changes slowly, filtering out high frequencies and leaving only frequencies close to DC. Note that this is opposite to what grandparent says: 0.99 has a lower frequency cutoff than 0.6 does. So I'm not really sure what they're getting at there.

[1] https://keras.io/layers/normalization/#batchnormalization

[2] https://arxiv.org/abs/1502.03167

kortex · on March 2, 2017

When working with images, do you use mode 0, 1, or 2?

arjo129 · on March 2, 2017

They work well, just that you need a lot of patience (and know how) to work with them. Also GPUs are expensive. By the time you realize that you messed up you have wasted a lot of time. Of course this is true with any ml algorithm out there. But what I'm trying to say is it is possible that an as yet unknown method exists that may be less computationally complex.

One of the problems I see is that people abuse deep neural networks no end. One doesn't need to train a deep nn for recognizing structured objects like a coke can in a fridge. Simple hog/sift/other feature engineering may be a faster and better bet for small-scale object recognition. However expecting sift to out perform a deep neural net on imagenet is out of question. Thus when it comes to deploying systems in a short frame of time one should keep an open mind.

modeless · on March 2, 2017

> One doesn't need to train a deep nn for recognizing structured objects like a coke can in a fridge.

I disagree. Sure, you don't need a NN to recognize one Coke can in one fridge for your toy robot project. If you want to recognize all Coke cans in all fridges, for your real-world, consumer-ready Coke-fetching robot product? You're going to need a huge dataset of all the various designs of Coke cans out there, in all the different kinds of refrigerators, and your toy feature engineered approach is going to lose to a NN on that kind of varied dataset.

snovv_crash · on March 2, 2017

Which is why you should do stereo or SfM, make a 3d reconstruction, and then do HOG or some 3D feature to recognise the coke can.

Trying to do it from images with a NN that doesn't comprehend 3D space is just silly.

sdenton4 · on March 2, 2017

I'm not sure if you're serious or throwing some very excellent shade.

snovv_crash · on March 3, 2017

I'm serious. If you have to rely on mono, single image inputs then yeah ImageNet is going to do better. But it will also mistake every picture of a coke can as the real thing. It will be horrifically sensitive to malicious inputs. Much better would be to use 2 calibrated lenses and do 3D reconstruction. Even if you're just doing the reconstruction as a sanity check for a NN to weed out the false positives.

sgt101 · on March 2, 2017

Errm, hang on, are you saying that if you have a task of classifying unseen images given a labelled training set you should get a stereo camera or video camera and create another problem?

Which you can solve?

Because the problem is silly>

What if I say : "I will give you $10m to solve it, and if you fail, I will kill this very kind old monkey?"

snovv_crash · on March 3, 2017

Object recognition doesn't only exist in the subspace of labelled 2D images. It tends to be derived from a 3D space, which is a whole extra orthogonal data source that the "NN all the things" crowd is fastidiously ignoring.

Why, I'm not sure, but I'm guessing because it is hard/inaccurate to do with just NNs and parameter/network architecture tweaking. Possibly also because benchmarks with single mono images are much easier to make.

Just because it is hard with method A, and is harder to make benchmarks, doesn't mean method B isn't better.

sgt101 · on March 3, 2017

Yes but am I missing something when I say that if the problem is to deal with labelled 2d images declaring that you should be working with 3d images or short video sequences doesn't help.

Sure, if you are building a Robot and I say "use this camera and a deep network" and you say "It'll work better with stereo" well... yes super do that!

But if we are working with mono images I don't understand how the observation helps?

snovv_crash · on March 3, 2017

Quoting from GP of your original reply:

> If you want to recognize all Coke cans in all fridges, for your real-world, consumer-ready Coke-fetching robot product?

If you're stuck with a mono dataset, post collection, then sure use NN and call it a day. But even if you have video you can do 3D reconstruction just from baseline movement. You won't know scale, so you can't differentiate between big coke cans and little coke cans, but at least you can rule out pictures of coke cans.

amelius · on March 2, 2017

But an NN can complete mess up when a new refrigerator is used, that wasn't part of the training set.

Also, the training is very asymmetric, since there are many more things NOT coke cans than there are coke cans.

modeless · on March 2, 2017

> But an NN can complete mess up when a new refrigerator is used, that wasn't part of the training set

Not if your training set is representative. And this is just as true of feature engineered approaches, the only difference is that dealing with real world variation requires a lot less work with NNs because once you add the variation to your dataset you're done. With feature engineering that's only the first step because now you have to figure out where the new variation is breaking your features and how to modify them to fix it.

ska · on March 2, 2017

"Not if your training set is representative."

And herein lies a prominent failure mode of a huge amount of this sort of work that I've seen - hard to just "add the variation to your dataset" when your data set is one or more orders of magnitude too small to contain it. At that point all that remains is the handwaving.

The right response to insufficient data is usually simplifying the modeling.

barbolo · on March 2, 2017

I agree with rkaplan. I've been working with many different visual problems and that comment is pretty consistent with what I've seen.

cosminro · on March 2, 2017

No batch norm for LSTMs

fnl · on March 3, 2017

Totally agree, but RF's are more related to "CNN-ish problems" (image classification and...?), not RNNs, or generally, any graphical sequence model.

EDIT: to clarify: "j/k" with the thing in parenthesis ;-)

gcr · on March 2, 2017

GAN training is still spooky mysterious and can easily fail in nonintuitive ways.

Sometimes GANs converge or not depending on the random number seed, even with the same hyperparameters.

gwern · on March 2, 2017

I'm not sure about that. The new GAN models over the past 2-3 months, like LS-GAN or WGAN, all seem to train much more stably. I've beaten up on WGAN with all sorts of strange tweaks and hyperparameter settings and while it may not work well, it's never catastrophically diverged on me the way DCGAN would at the drop of a hat.

AlexCoventry · on March 2, 2017

Have you found any good ways to speed it up? The five-fold training on the Critic is very expensive.

gwern · on March 3, 2017

No, not yet. I suspect that increasing the discriminator-only learning rate might help but haven't tried.

igul222 · on March 2, 2017

Try removing BN from the critic :)

rkaplan · on Dec 26, 2016

This paper is important for a few reasons:

- Labeled data is very expensive. Historically attempts to learn on synthetic data has failed because ConvNets are very good at detecting small visual artifacts in the synthetic data and using those for classification during training. At test time on real data, those artifacts aren't present so model fails. A technique that can beat state-of-the-art (admittedly on a very narrow Eye Gaze dataset, but still) by only training on labels from synthetic data and testing on real data is important.

- They present a useful new idea to improve GAN training: using a history of "fake" images, rather than only the latest fake images from the generator. Ask anyone who has tried to train a GAN: the training is really unstable, each network only cares about beating the latest version of its "opponent". They show good improvements by saving many previous fake outputs to make the generator more robust. This reminds me of Experience Replay from DeepMind for RL.

- It's a published paper from Apple! Great that they are starting to contribute back to the research community.

rkaplan · on Oct 13, 2016

This paper builds off of DeepMind's previous work on differentiable computation: Neural Turing Machines. That paper generated a lot of enthusiasm when it came out in 2014, but not many researchers use NTMs today.

The feeling among researchers I've spoken to is not that NTMs aren't useful. DeepMind is simply operating on another level. Other researchers don't understand the intuitions behind the architecture well enough to make progress with it. But it seems like DeepMind, and specifically Alex Graves (first author on NTMs and now this), can.

modeless · on Oct 13, 2016

The reason other researchers haven't jumped on NTMs may be that, unlike commonly-researched types of neural nets such as CNNs or RNNs, NTMs are not currently the best way to solve any real-world problem. The problems they have solved so far are relatively trivial, and they are very inefficient, inaccurate, and complex relative to traditional CS methods (e.g. Dijkstra's algorithm coded in C).

That's not to say that NTMs are bad or uninteresting! They are super cool and I think have huge potential in natural language understanding, reasoning, and planning. However, I do think that DeepMind will have to prove that they can be used to solve some non-trivial task, one that can't be solved much more efficiently with traditional CS methods, before people will join in to their research.

Also, I think there's a possibility that solving non-trivial problems with NTMs may require more computing power than Moore's law has given us so far. In the same way that NNs didn't really take off until GPU implementations became available, we may have to wait for the next big hardware breakthrough for NTMs to come into their own.

empath75 · on Oct 13, 2016

The brain is not a single universal neural network that does everything well. It's a collection of different neural networks that specialize in different tasks, and probably use very different methods to achieve them.

It seems like the way forward would be networking together various kinds of neural networks to achieve complex goals. For example, an NTM specialized in formulating plans that has access to a CNN for image recognition, and so on.

vinay427 · on Oct 13, 2016

This is being done using various types of networks. See these slides on image captioning by Karpathy for an example using a CNN and RNN: http://cs.stanford.edu/people/karpathy/sfmltalk.pdf

Senji · on Oct 13, 2016

If we're going with a brain metaphor. What would be the those neural networks' version of synesthesia?

empath75 · on Oct 13, 2016

Feeding mp3s to an image recognition neural net. And as soon as I typed that, I want to try it.

modeless · on Oct 13, 2016

Actually, in the architecture you described, if there is a planning net that's connected to image net and an audio net, rather than feeding audio to the image net I think synesthesia would be better modeled by feeding the output of the audio net into the image net's input on the planning net. If that makes sense.

Senji · on Oct 13, 2016

Not the output. Making several single connections from intermediate layers from the different nets.

dharma1 · on Oct 13, 2016

CNNs can actually be used for audio tasks too, on spectrograms

Senji · on Oct 14, 2016

It's how some guys defeated the first iteration of recaptcha's audio mode. Then google replaced it with something very annoying to use even for humans.

svantana · on Oct 13, 2016

They sure put a lot of focus on "toy" problems such as sorting and path planning in their papers - perhaps because they are easy to understand and show a major improvement over other ML approaches. IMHO they should focus more on "real" problems - e.g. in Table 1 of this paper it seems to be state of the art on the bAbl tasks, which is amazing.

gradys · on Oct 13, 2016

At least some of the "toy" problems aren't chosen just for being easy to solve or understand. They're chosen for being qualitatively different than the kinds of problems other neural nets are capable of solving. Sorting, for example, is not something you can accomplish in practice with an LSTM.

Mainstream work on neural nets is focused on pattern recognition and generation of various forms. I don't mean to trivialize at all when I say this - this gives us a new way to solve problems with computers. It allows us to go beyond the paradigm of hand-built algorithms over bytes in memory.

What DeepMind is exploring with this line of research is whether neural nets can even subsume this older paradigm. Can they learn to induce the kinds of algorithms we're used to writing in our text editors? Given this goal, I think it's better to call problems like sorting "elementary" rather than "toy".

sherjilozair · on Oct 13, 2016

bAbI isn't really a "real" problem either, although somewhat better than sorting and the like. bAbI works with extremely restrictive worlds and grammar. In contrast, current speech recognition, language modeling, and object detection do quite well with actual audio, text, and pictures.

I think the strength of NTMs will be best demonstrated by putting it to work on a long-range language modeling task where you need to organize what you read so that you can use it to predict better a paragraph or two later. Current language models based on LSTM are not really able to do this.

ludoplex · on Oct 15, 2016

Any chance you could link a pdf of the paper for us?

TeeWEE · on Oct 13, 2016

Once you have a learning machine that can solve simple problems. You can scale it up to solve very complex problems. Its a first step to true AI imho. Al lot of small steps are needed to go towards this goal. Integrating Memory & Neural Nets is a big step imho.

chriswarbo · on Oct 13, 2016

> Once you have a learning machine that can solve simple problems. You can scale it up to solve very complex problems.

Nope. It's really easy to solve simple problems; it can sometimes even be done by brute-force.

That's what caused the initial optimism around AI, e.g. the 1950s notion that it would be an interesting summer project for a grad student.

Insights into computational complexity during the 1960s showed that scaling is actually the difficult part. After all, if brute-force were scalable then there'd be no reason to write any other software (even if a more efficient program were required, the brute-forcer could write it for us).

That's why the rapid progress on simple problems, e.g. using Eliza, SHRDLU, General Problem Solver, etc. hasn't been sustained, and why we can't just run those systems on a modern cluster and expect them to tackle realistic problems.

visarga · on Oct 13, 2016

Deep mind is breaking new ground in number of directions. For example, "Decoupled Neural Interfaces using Synthetic Gradients" is simply amazing - they can make training a net async and run individual layers on separate machines by approximating the gradients with a local net. It's the kind of thing that sounds crazy on paper, but they proved it works.

Another amazing thing they did was to generate audio by direct synthesis from a neural net, beating all previous benchmarks. If they can make it work in real time, it would be a huge upgrade in our TTS technology.

We're still waiting for the new and improved AlphaGo. I hope they don't bury that project.

shostack · on Oct 13, 2016

I'm not super knowledgeable about the space, but would the audio generation you mentioned be what is needed to let their Assistant communicate verbally in any language, any voice, add inflections, emotion, etc. without needing to pre-record all the chunks/combinations?

outsideline · on Oct 13, 2016

Decoupled Neural Interfaces using Synthetic Gradients is a fancy name for the electro-chemical gradient that lies outside the cell wall of neurons : https://en.wikipedia.org/wiki/Electrochemical_gradient

It's decoupled yet stores transient local information regarding previous neuron activity.

Another bio-inspired copy-pasta.

vintermann · on Oct 13, 2016

You should absolutely get a job doing it, if you think bio-inspired copy-pasta is all it takes. May I recommend Numenta?

xpe · on Oct 13, 2016

Please choose derogatory phrases like 'copy pasta' intentionally and carefully.

Many algorithms are bio-inspired -- good artists borrow, the best steal.

wiz21c · on Oct 13, 2016

>> DeepMind is simply operating on another level.

Would you be so kind as to to explain what you mean here ?

Thanks !

outsideline · on Oct 13, 2016

They're taking features that are present in the brain that aren't modeled and are making computational models for them. They're not a gold standard. You can create your own in under an hour. It's not another level. It's bio-inspired computing.

Here.. take the 'Axon Hillock' https://en.wikipedia.org/wiki/Axon_hillock code up a function for it, attach it to present day neuron models, make it do something fancy, write a white-paper and kazaam you're operating on another level..

Get it?

wiz21c · on Oct 13, 2016

ok I get it :) Nice little sarcasm, I'm loving it :-)

outsideline · on Oct 13, 2016

Alan Turing's tape machine + neuron model.

In the human brain, Neurons store an incredible amount of information. Neuron models in neural networks only did so with weights.

There is still a lack of understanding on how the human brain does it. Deep Mind grabbed a proven memory model from Alan Turing's work and applied it to the feature barren neuron models in use. Sprinkle magic ...

They are not operating on another level, they're bringing over features that are well documented in the human brain and in white papers from a past period when people actually thought deeply about this problem and applying it.

https://en.wikipedia.org/wiki/Bio-inspired_computing

There is no 'intuition' about the architecture. Study the human brain and copy pasta into the computing realm.

Others are doing this as well. If anyone bothered to read the white papers people publish, you'll see that many people have presented similar ideas over the years.

You can come up with your own neural Turing machine. Take a featureless neuron model, slap a memory module on it and you have a neural turing machine.

vintermann · on Oct 13, 2016

In order to use a turing machine in a neural network - or at least to train it, in any way that isn't impractical and/or cheating - you need to make it differentiable somehow.

Graves and co. have been really creative in overcoming problems in their ongoing program to differentiate ALL the things.

igravious · on Oct 13, 2016

In this context what does differentiable mean?

gugagore · on Oct 13, 2016

I think the easiest way to see this is by an example of a non-differentable architecture.

Let's suppose on the current training input, the network produces some output that is a little wrong. It produced this output by reading a value v at location x of memory.

In other words, output = v = mem[x]

It could be wrong because the value in memory should have been something else. In this case, you can propagate the gradient backwards. Whatever the error was at the output, is also the error at this memory location.

Or it could be wrong because it read from the wrong memory location. Now you're a bit dead in the water. You have some memory address x, and you want to take the derivative of v with respect to x. But x is this sort of thing that jumps discretely (just as an integer memory address does). You can't wiggle x to see what effect it has on v, which means that you don't know which direction x should move in in order to reduce the error.

So (at least in the 2014 paper, ignoring the content-addressed memory), memory accesses don't look like v = mem[x]. They look like v = sum_i(a_i * mem[i]). Any time you read from memory, you're actually reading all the memory, and taking a weighted sum of the memory values. And now you can take derivatives with respect to that weighting.

To me, the question this raises is, what right do we have to call this a Turing machine. This is a very strong departure from Turing machines and digital computers.

iandanforth · on Oct 13, 2016

Turing didn't specify how reads and writes happened on the tape. For the argument he was making it was clearer to assume there was no noise in the system.

As for "digital" computers remember they are built out of noisy physical systems. Any bit in the CPU is actually a range of voltages that we squash into the abstract concept of binary.

gugagore · on Oct 14, 2016

I don't think that is really relevant to the discussion. Regardless of how a digital computer is physically implemented, we use it according to specification. We concretize the concept of binary by designing the machine to withstand noise. The thing what we get when we choose the digital abstraction is that this is actually realistic. Digital computers pretty much operate digitally. Corruption happens, but we consider that an error, and we try to design so that a programmer designing all but the most critical of applications, should assume that memory does not get corrupted

We don't squash the range of voltages. The digital component that interprets that voltage does the squashing. And we design it that way purposefully. https://en.wikipedia.org/wiki/Static_discipline

Turing specified that the reads and the writes are done by heads, which touch a single tape position. You can have multiple (finitely many) tapes and heads, without leaving the class of "Turing machine". But nothing like blending symbols from adjacent locations on the tape, or requiring non-local access to the tape.

igravious · on Oct 13, 2016

No wonder Google built (is building) custom accelerators in hardware. This points to a completely different architecture from Von Neumann, or at least it points to MLPUs, Machine Learning Processing Units.

shostack · on Oct 13, 2016

Pardon my ignorance as I'm not super knowledgeable on this, but is what you described around reading all the memory and taking the weighted sum of values similar in a sense to creating a checksum to compare something against?

gugagore · on Oct 14, 2016

I suppose I can see the similarity, in that there's some accumulated value (the sum) from reading some segment of memory, but otherwise I don't think the comparison is helpful.

ebalit · on Oct 13, 2016

It means it can be trained by backpropagating the error gradient through the network.

To train a neural network, you want to know how much each component contributed to an error. We do that by propagating the error through each component in reverse, using the partial derivatives of the corresponding function.

TeeWEE · on Oct 13, 2016

Dont forget you first need to understand the mathemetical theory of how a brain does computation and pattern recognition. Off course they look into how a brain does it. But the mathematical underpinnings, and how the information flows is much more important than how an individual neuron works in real live. Abstraction and applying it to real data is what they are doing.

usgroup · on Oct 13, 2016

Any chance you could fix this statement:

Input = Data

Process = Optimisation to create an automata.

Output = Automata

Computer power means much larger variable spaces can be handled in optimisation problems. NN are a means to prune the variable space during optimisation in a domain unspecific way.

rkaplan · on April 19, 2016

Highly recommend this video on the Deep Visualization Toolbox to anyone interesting in understanding more about how convnets work through visualization:

https://www.youtube.com/watch?v=AgkfIQ4IGaM