This is the correct answer. A key reason Rebco is so much better than older alternatives is that it can be cooled to superconductivity using only liquid nitrogen (77K), as opposed to liquid helium, which is much harder to work with.
Nit: ultimately the new magnets need to be stronger for the same volume no matter how easy the cooling is to work with. That part is just a nice bonus.
Longtime DVC user here - this is going to be so helpful. We use DVC for all of our model and data versioning, but what's been missing is the ability to cleanly integrate that into our CI workflow. Looks like that's solved now! The cml.yaml syntax also looks quite nice, very easy to follow. Looking forward to trying this out.
Great anecdote from Jeff Dean's twitter [1] about how they thought to try this experiment, given that doctors didn't know such a thing was possible:
"Funny story. We had a new team member joining, and @lhpeng suggested an orientation project for them of "Why don't you just try predicting age and gender from the images?" to get them familiar with our software setup, thinking age might be accurate within a couple of decades,and gender would be no better than chance. They went away and worked on this and came back with results that were much more accurate than expected, leading to more investigation about what else could be predicted."
Something to keep in mind the next time someone assures you that 'gaydar' is pseudoscience and totally impossible and any paper must be datamining - everything is correlated. We are routinely surprised what can be extracted from data, and anyone telling you a priori what cannot be done is pushing their politics.
This is not "AI that builds AI". The actual research behind AutoML is called NASNet (https://arxiv.org/pdf/1707.07012.pdf), and all it is simply: we found two good neural network layers (called NASNet normal cells / reduction cells in the paper) that work well on many different image datasets. It's a very cool research result. But it's not something that will replace AI researchers.
Yeah, I'm confused that this is the top comment; it's factually incorrect. NASNet is an example of a result of AutoML. To quote the Google blogpost on NASNet:
>In Learning Transferable Architectures for Scalable Image Recognition, we apply AutoML to the ImageNet image classification and COCO object detection dataset... AutoML was able to find the best layers that work well on CIFAR-10 but work well on ImageNet classification and COCO object detection. These two layers are combined to form a novel architecture, which we called “NASNet”.
In contrast AutoML is, as the nytimes article describes, "a machine-learning algorithm that learns to build other machine-learning algorithms". More specifically, from the Google blogpost about AutoML:
>In our approach (which we call "AutoML"), a controller neural net can propose a “child” model architecture, which can then be trained and evaluated for quality on a particular task...Eventually the controller learns to assign high probability to areas of architecture space that achieve better accuracy on a held-out validation dataset, and low probability to areas of architecture space that score poorly.
Quoc, Barret, and others have been working on ANN-architecture-design systems for a while now (see: https://arxiv.org/abs/1611.01578), and AutoML specifically was done before announcing NASNet. Saying that NASNet is "the actual research behind AutoML" is drawing the causal arrow backwards.
It takes little imagination to see how methods used for designing neural networks can be applied to other parts of ML (e.g. optimization, feature selection etc.) AutoML is definitely not a subset of NASNet..
Yeah, it's pretty sad the nytimes is writing such facile clickbait. I mean, don't these articles have to pass some kind of review? Makes you wonder about other science based articles they publish.
Some of the learning to learn models do much more than fine-tuning parameters, they can even discover novel architectures. On the other hand, using meta-learning can be a way to check if human generated solutions are up to par, because random search can be more thorough and even try absurd ideas that might work out.
In programming we have tons of automation as well and we haven't ditched the programmer yet. Programming is auto-cannibalizing itself since its inception, each language automating more of our work. Even in ML, 10 years ago it was necessary to create features by hand. This required a lot of expertise. Today it's been automated by DL, but we have more AI scientists than ever and the jobs are even better paid.
So I don't think meta-learning is a fluff idea, and we don't have to fear it replacing humans yet. Instead, it will make AI more robust. The only minus I see is that it requires a lot of compute, but we can rent that from the cloud (make an architecture search for a few thousand dollars), we don't need to fork millions of dollars like the big labs who own their hardware. And we don't need this kind of intensive DL all the time, just once maybe, for a project. After we find the best architecture and hyperparameters, we can use that and train normally. By collating meta-learning data across many projects, we can make training faster and cheaper, reusing insight gained before.
Creating features has not been automated by Deep Learning at all. Even for image recognition tasks, where your "features" are simply the pixels of an image, there's still lots of preprocessing work to get those images into a form that NNs can deal with well.
Feature engineering is actually still the hardest part of most ML tasks, because it can not be optimized by a simple grid search like the hyperparameters of a model.
The AI Grant fellowship has been an awesome experience so far; I highly recommend applying if you're on the fence. The grant itself is useful (and this round includes $20k of Google cloud credits!) but equally valuable is the community that Nat has curated around the project. There are a lot of great folks working on AI from all sorts of places just a Slack message away. Now that Daniel Gross (YC's AI-focused partner) is involved as well, I'd expect that community to continue to grow.
This post doesn't even mention the easiest way to use deep learning without a lot of data: download a pretrained model and fine-tune the last few layers on your small dataset. In many domains (like image classification, the task in this blog post) fine-tuning works extremely well, because the pretrained model has learned generic features in the early layers that are useful for many datasets, not just the one trained on.
Even the best skin cancer classifier [1] was pretrained on ImageNet.
This is how the great fast.ai course begins - download VGG16, finetune the top layer with a single dense layer, get amazing results. The second or third class shows how to make the top layers a bit more complex to get even better accuracy.
I'm skimming through the content and it seems really great! I'm interested in the last lesson (7-Exotic CNN Arch), but I'm afraid of missing other cool stuff in past lessons.
What do you suggest for someone who has experience with Deep Learning?
I think one of the strengths of the course is that Jeremy shows parts of the process of working on a ML problem. If you have time, I recommend watching earlier lessons, even if you know the theoretical aspects of the content covered.
Same!! On lesson 2 and already feel like I know so much! The top down approach to learning is great! Will actually read the book "Making Learning Whole" that inspired them to follow this approach.
Similarly for word embeddings like word2vec, GLoVE, fasttext etc in the case of NLP.
I think this is fundamental - if you teach a human how to recognize street signs, you don't need to show them millions of examples - just one or a few of each is enough because we build on reference experiences of past objects seen through life experience to encode the new images as memories.
My main doubt comes the fact that language meaning may vary between different contexts, but I am no expert and am earnestly curious about using NLP and ML with not-that-big data.
Vector representation are useful for many natural language processing tasks.
In word embedding like word2vec, GLoVE, fasttext for each word the algorithm learns an associated vector in dimension n (x1, x2, .., xn).
A word maybe close to another in one dimension (or a subspace) but far away in another one. Moreover good representation allows meaningful vector space arithmetic: Queen - Women == King
word representations are typically trained on very large unlabeled data, but once the algorithm learns the features you can use them for your small dataset.
EDIT: Add more explanations.
This blog post[1] is a great example of positive unintended consequences of deep learning:
> We were very surprised that our model learned an interpretable feature, and that simply predicting the next character in Amazon reviews resulted in discovering the concept of sentiment.
> You don’t need Google-scale data to use deep learning. Using all of the above means that even your average person with only a 100-1000 samples can see some benefit from deep learning. With all of these techniques you can mitigate the variance issue, while still benefitting from the flexibility. You can even build on others work through things like transfer learning.
To be fair to the original article, his assertion was more along the lines of you can't train a deep net without lots of data. As the second article shows, that isn't true in the general case. However, it is certainly true for creating any of the interesting models you think of when you think of deep nets (i.e., Inception, word2vec, etc.). You just can't the richness of these without a lot of data to train them.
Transfer learning is efficient (minimal training time) and useful for most classification tasks across various domains.
Some of the first models I've used were built on Inception/ImageNet and I recall being thoroughly impressed by the performance.
This only works though if the pretraining and your training, both are with the data in the same domain. Even in that you will have issues if the data was from same domain but different in representation, e.g. 2D and 3D image datasets.
What would you consider the best resource for learning how to do this in Python? I have a smaller set of image data where I'd like to identify the components of (i.e. 'house', 'car', etc.).
Watch the very first 1 or 2 videos from http://fast.ai . They show how to do this in about seven lines of python and five minutes worth of model training.
"In contrast to deep neural networks which require great effort in hyper-parameter tuning, gcForest is much easier to train."
Hyperparameter tuning is not as much of an issue with deep neural networks anymore. Thanks to BatchNorm and more robust optimization algorithms, most of the time you can simply use Adam with a default learning rate of 0.001 and do pretty well. Dropout is not even necessary with many models that use BatchNorm nowadays, so generally tuning there is not an issue either. Many layers of 3x3 conv with stride 1 is still magical.
Basically: deep NNs can work pretty well with little to no tuning these days. The defaults just work.
I couldn't disagree more. The defaults don't just work, and the architecture of the network could also be considered a hyper parameter in which case what would be a reasonable default for all the types of problems ANN are used for?
Are you using batch normalization? If you are, an issue I see all the time is folks not setting the EMA filter coef correctly. In keras, it defaults to something like 0.99 which in my mind makes no sense. I use something around 0.6 and life is good. You want to get an overall good measurement of the statistics and in my mind the frequency cutoff when coef=0.99 is just way too high for most application. You usually want something that filters out just about everything except very close to DC.
The response to "the defaults should work just fine without any hyperparameter tuning" is "try fiddling with the EMA filter coefficient hyperparameter" ?
It's like the joke of the mathematician giving an exposition of a complex proof. At one point he says "It is obvious that X", pauses, scratches his head, does a few calculations. Leaves room for twenty minutes and returns. Then continues "it is obvious that X" and goes to the next step.
Deep in the field, it's fine for machine learning experts to say "everything just works" [if you've mastered X, Y, Q esoteric fields and tuning methods] since they're welcome to "humble brag" as much as they want. But when this gets in the way of figuring out what really "just works" it's more of a problem.
I think they're referring to the momentum parameter at [1]. The exponential moving average (EMA) of the batch mean/variance is used in the batch normalizing transform (Algorithm 1 in [2]).
The momentum ranges from 0 to 1. If it's close to 1, which the default of 0.99 is, the EMA of the batch mean/variance will change slowly across batches. If it's close to 0, the EMA will be close to the mean/variance of the current batch.
The EMA acts as a low-pass filter. With a momentum close to 1, the EMA changes slowly, filtering out high frequencies and leaving only frequencies close to DC. Note that this is opposite to what grandparent says: 0.99 has a lower frequency cutoff than 0.6 does. So I'm not really sure what they're getting at there.
They work well, just that you need a lot of patience (and know how) to work with them. Also GPUs are expensive. By the time you realize that you messed up you have wasted a lot of time. Of course this is true with any ml algorithm out there. But what I'm trying to say is it is possible that an as yet unknown method exists that may be less computationally complex.
One of the problems I see is that people abuse deep neural networks no end. One doesn't need to train a deep nn for recognizing structured objects like a coke can in a fridge. Simple hog/sift/other feature engineering may be a faster and better bet for small-scale object recognition. However expecting sift to out perform a deep neural net on imagenet is out of question. Thus when it comes to deploying systems in a short frame of time one should keep an open mind.
> One doesn't need to train a deep nn for recognizing structured objects like a coke can in a fridge.
I disagree. Sure, you don't need a NN to recognize one Coke can in one fridge for your toy robot project. If you want to recognize all Coke cans in all fridges, for your real-world, consumer-ready Coke-fetching robot product? You're going to need a huge dataset of all the various designs of Coke cans out there, in all the different kinds of refrigerators, and your toy feature engineered approach is going to lose to a NN on that kind of varied dataset.
I'm serious. If you have to rely on mono, single image inputs then yeah ImageNet is going to do better. But it will also mistake every picture of a coke can as the real thing. It will be horrifically sensitive to malicious inputs. Much better would be to use 2 calibrated lenses and do 3D reconstruction. Even if you're just doing the reconstruction as a sanity check for a NN to weed out the false positives.
Errm, hang on, are you saying that if you have a task of classifying unseen images given a labelled training set you should get a stereo camera or video camera and create another problem?
Which you can solve?
Because the problem is silly>
What if I say : "I will give you $10m to solve it, and if you fail, I will kill this very kind old monkey?"
Object recognition doesn't only exist in the subspace of labelled 2D images. It tends to be derived from a 3D space, which is a whole extra orthogonal data source that the "NN all the things" crowd is fastidiously ignoring.
Why, I'm not sure, but I'm guessing because it is hard/inaccurate to do with just NNs and parameter/network architecture tweaking. Possibly also because benchmarks with single mono images are much easier to make.
Just because it is hard with method A, and is harder to make benchmarks, doesn't mean method B isn't better.
Yes but am I missing something when I say that if the problem is to deal with labelled 2d images declaring that you should be working with 3d images or short video sequences doesn't help.
Sure, if you are building a Robot and I say "use this camera and a deep network" and you say "It'll work better with stereo" well... yes super do that!
But if we are working with mono images I don't understand how the observation helps?
> If you want to recognize all Coke cans in all fridges, for your real-world, consumer-ready Coke-fetching robot product?
If you're stuck with a mono dataset, post collection, then sure use NN and call it a day. But even if you have video you can do 3D reconstruction just from baseline movement. You won't know scale, so you can't differentiate between big coke cans and little coke cans, but at least you can rule out pictures of coke cans.
> But an NN can complete mess up when a new refrigerator is used, that wasn't part of the training set
Not if your training set is representative. And this is just as true of feature engineered approaches, the only difference is that dealing with real world variation requires a lot less work with NNs because once you add the variation to your dataset you're done. With feature engineering that's only the first step because now you have to figure out where the new variation is breaking your features and how to modify them to fix it.
And herein lies a prominent failure mode of a huge amount of this sort of work that I've seen - hard to just "add the variation to your dataset" when your data set is one or more orders of magnitude too small to contain it. At that point all that remains is the handwaving.
The right response to insufficient data is usually simplifying the modeling.
I'm not sure about that. The new GAN models over the past 2-3 months, like LS-GAN or WGAN, all seem to train much more stably. I've beaten up on WGAN with all sorts of strange tweaks and hyperparameter settings and while it may not work well, it's never catastrophically diverged on me the way DCGAN would at the drop of a hat.
- Labeled data is very expensive. Historically attempts to learn on synthetic data has failed because ConvNets are very good at detecting small visual artifacts in the synthetic data and using those for classification during training. At test time on real data, those artifacts aren't present so model fails. A technique that can beat state-of-the-art (admittedly on a very narrow Eye Gaze dataset, but still) by only training on labels from synthetic data and testing on real data is important.
- They present a useful new idea to improve GAN training: using a history of "fake" images, rather than only the latest fake images from the generator. Ask anyone who has tried to train a GAN: the training is really unstable, each network only cares about beating the latest version of its "opponent". They show good improvements by saving many previous fake outputs to make the generator more robust. This reminds me of Experience Replay from DeepMind for RL.
- It's a published paper from Apple! Great that they are starting to contribute back to the research community.
This paper builds off of DeepMind's previous work on differentiable computation: Neural Turing Machines. That paper generated a lot of enthusiasm when it came out in 2014, but not many researchers use NTMs today.
The feeling among researchers I've spoken to is not that NTMs aren't useful. DeepMind is simply operating on another level. Other researchers don't understand the intuitions behind the architecture well enough to make progress with it. But it seems like DeepMind, and specifically Alex Graves (first author on NTMs and now this), can.
The reason other researchers haven't jumped on NTMs may be that, unlike commonly-researched types of neural nets such as CNNs or RNNs, NTMs are not currently the best way to solve any real-world problem. The problems they have solved so far are relatively trivial, and they are very inefficient, inaccurate, and complex relative to traditional CS methods (e.g. Dijkstra's algorithm coded in C).
That's not to say that NTMs are bad or uninteresting! They are super cool and I think have huge potential in natural language understanding, reasoning, and planning. However, I do think that DeepMind will have to prove that they can be used to solve some non-trivial task, one that can't be solved much more efficiently with traditional CS methods, before people will join in to their research.
Also, I think there's a possibility that solving non-trivial problems with NTMs may require more computing power than Moore's law has given us so far. In the same way that NNs didn't really take off until GPU implementations became available, we may have to wait for the next big hardware breakthrough for NTMs to come into their own.
The brain is not a single universal neural network that does everything well. It's a collection of different neural networks that specialize in different tasks, and probably use very different methods to achieve them.
It seems like the way forward would be networking together various kinds of neural networks to achieve complex goals. For example, an NTM specialized in formulating plans that has access to a CNN for image recognition, and so on.
Actually, in the architecture you described, if there is a planning net that's connected to image net and an audio net, rather than feeding audio to the image net I think synesthesia would be better modeled by feeding the output of the audio net into the image net's input on the planning net. If that makes sense.
It's how some guys defeated the first iteration of recaptcha's audio mode. Then google replaced it with something very annoying to use even for humans.
They sure put a lot of focus on "toy" problems such as sorting and path planning in their papers - perhaps because they are easy to understand and show a major improvement over other ML approaches. IMHO they should focus more on "real" problems - e.g. in Table 1 of this paper it seems to be state of the art on the bAbl tasks, which is amazing.
At least some of the "toy" problems aren't chosen just for being easy to solve or understand. They're chosen for being qualitatively different than the kinds of problems other neural nets are capable of solving. Sorting, for example, is not something you can accomplish in practice with an LSTM.
Mainstream work on neural nets is focused on pattern recognition and generation of various forms. I don't mean to trivialize at all when I say this - this gives us a new way to solve problems with computers. It allows us to go beyond the paradigm of hand-built algorithms over bytes in memory.
What DeepMind is exploring with this line of research is whether neural nets can even subsume this older paradigm. Can they learn to induce the kinds of algorithms we're used to writing in our text editors? Given this goal, I think it's better to call problems like sorting "elementary" rather than "toy".
bAbI isn't really a "real" problem either, although somewhat better than sorting and the like. bAbI works with extremely restrictive worlds and grammar. In contrast, current speech recognition, language modeling, and object detection do quite well with actual audio, text, and pictures.
I think the strength of NTMs will be best demonstrated by putting it to work on a long-range language modeling task where you need to organize what you read so that you can use it to predict better a paragraph or two later. Current language models based on LSTM are not really able to do this.
Once you have a learning machine that can solve simple problems. You can scale it up to solve very complex problems. Its a first step to true AI imho. Al lot of small steps are needed to go towards this goal. Integrating Memory & Neural Nets is a big step imho.
> Once you have a learning machine that can solve simple problems. You can scale it up to solve very complex problems.
Nope. It's really easy to solve simple problems; it can sometimes even be done by brute-force.
That's what caused the initial optimism around AI, e.g. the 1950s notion that it would be an interesting summer project for a grad student.
Insights into computational complexity during the 1960s showed that scaling is actually the difficult part. After all, if brute-force were scalable then there'd be no reason to write any other software (even if a more efficient program were required, the brute-forcer could write it for us).
That's why the rapid progress on simple problems, e.g. using Eliza, SHRDLU, General Problem Solver, etc. hasn't been sustained, and why we can't just run those systems on a modern cluster and expect them to tackle realistic problems.
Deep mind is breaking new ground in number of directions. For example, "Decoupled Neural Interfaces using Synthetic Gradients" is simply amazing - they can make training a net async and run individual layers on separate machines by approximating the gradients with a local net. It's the kind of thing that sounds crazy on paper, but they proved it works.
Another amazing thing they did was to generate audio by direct synthesis from a neural net, beating all previous benchmarks. If they can make it work in real time, it would be a huge upgrade in our TTS technology.
We're still waiting for the new and improved AlphaGo. I hope they don't bury that project.
I'm not super knowledgeable about the space, but would the audio generation you mentioned be what is needed to let their Assistant communicate verbally in any language, any voice, add inflections, emotion, etc. without needing to pre-record all the chunks/combinations?
They're taking features that are present in the brain that aren't modeled and are making computational models for them. They're not a gold standard. You can create your own in under an hour. It's not another level. It's bio-inspired computing.
Here.. take the 'Axon Hillock'
https://en.wikipedia.org/wiki/Axon_hillock
code up a function for it, attach it to present day neuron models, make it do something fancy, write a white-paper and kazaam you're operating on another level..
In the human brain, Neurons store an incredible amount of information. Neuron models in neural networks only did so with weights.
There is still a lack of understanding on how the human brain does it. Deep Mind grabbed a proven memory model from Alan Turing's work and applied it to the feature barren neuron models in use.
Sprinkle magic ...
They are not operating on another level, they're bringing
over features that are well documented in the human brain and in white papers from a past period when people actually thought deeply about this problem and applying it.
There is no 'intuition' about the architecture. Study the human brain and copy pasta into the computing realm.
Others are doing this as well. If anyone bothered to read the white papers people publish, you'll see that many people have presented similar ideas over the years.
You can come up with your own neural Turing machine. Take a featureless neuron model, slap a memory module on it and you have a neural turing machine.
In order to use a turing machine in a neural network - or at least to train it, in any way that isn't impractical and/or cheating - you need to make it differentiable somehow.
Graves and co. have been really creative in overcoming problems in their ongoing program to differentiate ALL the things.
I think the easiest way to see this is by an example of a non-differentable architecture.
Let's suppose on the current training input, the network produces some output that is a little wrong. It produced this output by reading a value v at location x of memory.
In other words, output = v = mem[x]
It could be wrong because the value in memory should have been something else. In this case, you can propagate the gradient backwards. Whatever the error was at the output, is also the error at this memory location.
Or it could be wrong because it read from the wrong memory location. Now you're a bit dead in the water. You have some memory address x, and you want to take the derivative of v with respect to x. But x is this sort of thing that jumps discretely (just as an integer memory address does). You can't wiggle x to see what effect it has on v, which means that you don't know which direction x should move in in order to reduce the error.
So (at least in the 2014 paper, ignoring the content-addressed memory), memory accesses don't look like v = mem[x]. They look like v = sum_i(a_i * mem[i]). Any time you read from memory, you're actually reading all the memory, and taking a weighted sum of the memory values. And now you can take derivatives with respect to that weighting.
To me, the question this raises is, what right do we have to call this a Turing machine. This is a very strong departure from Turing machines and digital computers.
Turing didn't specify how reads and writes happened on the tape. For the argument he was making it was clearer to assume there was no noise in the system.
As for "digital" computers remember they are built out of noisy physical systems. Any bit in the CPU is actually a range of voltages that we squash into the abstract concept of binary.
I don't think that is really relevant to the discussion. Regardless of how a digital computer is physically implemented, we use it according to specification. We concretize the concept of binary by designing the machine to withstand noise. The thing what we get when we choose the digital abstraction is that this is actually realistic. Digital computers pretty much operate digitally. Corruption happens, but we consider that an error, and we try to design so that a programmer designing all but the most critical of applications, should assume that memory does not get corrupted
We don't squash the range of voltages. The digital component that interprets that voltage does the squashing. And we design it that way purposefully. https://en.wikipedia.org/wiki/Static_discipline
Turing specified that the reads and the writes are done by heads, which touch a single tape position. You can have multiple (finitely many) tapes and heads, without leaving the class of "Turing machine". But nothing like blending symbols from adjacent locations on the tape, or requiring non-local access to the tape.
No wonder Google built (is building) custom accelerators in hardware. This points to a completely different architecture from Von Neumann, or at least it points to MLPUs, Machine Learning Processing Units.
Pardon my ignorance as I'm not super knowledgeable on this, but is what you described around reading all the memory and taking the weighted sum of values similar in a sense to creating a checksum to compare something against?
I suppose I can see the similarity, in that there's some accumulated value (the sum) from reading some segment of memory, but otherwise I don't think the comparison is helpful.
It means it can be trained by backpropagating the error gradient through the network.
To train a neural network, you want to know how much each component contributed to an error. We do that by propagating the error through each component in reverse, using the partial derivatives of the corresponding function.
Dont forget you first need to understand the mathemetical theory of how a brain does computation and pattern recognition. Off course they look into how a brain does it. But the mathematical underpinnings, and how the information flows is much more important than how an individual neuron works in real live. Abstraction and applying it to real data is what they are doing.
Computer power means much larger variable spaces can be handled in optimisation problems. NN are a means to prune the variable space during optimisation in a domain unspecific way.
Highly recommend this video on the Deep Visualization Toolbox to anyone interesting in understanding more about how convnets work through visualization: