He's leaving to work on decentralized AI? That's exactly what Stability AI was doing before it became clear the economics no longer work out in practice, and starting a new company wouldn't change that. (Emad is an advisory board member to a decentralized GPU company, though: https://home.otoy.com/stabilityai/ )
Obviously this is the polite way to send him off given the latest news about his leadership, but this rationale doesn't track.
Instead of wasting all the compute on bitcoin we pretrain fully open models which can run on people's hardware. A 120b ternary model is the most interesting thing in the world. No one can train one now because you need a billion dollar super computer.
SETI made sense because there is a lot of data where you download chunk and do expensive computation and return thin result.
Model training is unlike that. It's large state that is constantly updated and updates require full, up to date state.
This means you cannot distribute it efficiently over slow network with many smaller workers.
That's why NVIDIA is providing scalable clusters with specialized connectivity so they have ultra low latency and massive throughput.
Even in those setups it takes ie. a month to train base model.
Converted to distributed setup this same task would take billions of years - ie. it's not feasible.
There aren't any known ways of contributing computation without access to the full state. This would require completely different architecture, not only "different than transformers" but "different than gradient descent", which would be basically creating new branch in machine learning and starting from zero.
Safe bet is on "ain't going to happen" - better to focus on current state of art and keep advancing it until it builds itself and anything else we can dream of to reach this "mission fucking accomplished".
You're right that parameter updates typically require full state, but during my PhD I've explored some possibilities to address this limitation (unfortunately, my ideas didn't pan out in the time I had). That said, there is research that has explored this topic and made some progress, such as this paper:
Unfortunately it's hardly progress. MoE expert models are still large, have to be trained in usual, linear way, this approach requires training set classification upfront, each expert model is completely independent, each has to relearn concepts, your overall model is as good as dedicated expert, scale is in low numbers ie. 8, not thousands (otherwise you'd have to run inference on beefed up cluster only, experts still have to be loaded when used) etc.
But if we think of mixture of experts models outperforming "monolithic" models, why not? Maybe instead of 8 you can do 1000 and that is easy to paralellize. It sounds worth exploring to me.
I think the MoE models are trained together just like any other network though, including the dispatcher layer that has to learn which "expert" route each token to. Perhaps you could do some kind of technically worse model architecture that is trained separately and then a more complex dispatcher that then learns to utilize the individually trained experts as best as it can?
During MoE training you still need access to all weights.
1k experts would mean 30 TB of state to juggle with on 7B params. Training and inference is infeasible at this size.
If you'd want to keep the size while increasing number of experts, you'd end up with 7b -> 56m model. What kind of computation can you do on 56m model? Remember that expert model in MoE runs the whole inference without consulting or otherwise reusing any information from other experts. Thin network at the top just routes it to one of experts. But at this small size those are not "experts" anymore, it'd be more like Mixture of Idiots.
To put it in other way, MoE is optimization technique with low scaling ceiling that is more local maximum solution that global one (this idea works against you quickly if you want to go more that direction).
I don't think MoE allows for that either. You'd have to come up with a whole new architecture that allows parts to be trained independently and still somehow be merged together in the end.
Cool paper. It's more independent than dense or normal MoE but I think it's still far away from the distributed training you're looking for because you still need a seed LM which is trained normally and when fine-tuning each expert from the seed LM, you still need enough GPUs or VRAM to fine-tune that LLM so you're still limited to large GPU clusters which is the problem we're trying to avoid.
In the case of the paper, they are using OPT-6.7b as the seed LM which requires 8xV100 GPUs for fine-tuning each expert. That's a combined total of 256GB of VRAM for a single expert while the 3090 only has 24GB of VRAM and is still one of the most expensive GPUs out there.
Maybe we could use something like PEFT or QLoRA in combination with this technique to make each expert small enough for the community to fine-tune and make a worse Mixtral 8x7b, but I don't know enough to say for sure.
Or maybe it turns out we can make a good MoE model with thousands of smaller experts. Experts small enough for a separate member of the community to independently fine-tune on a normal GPU, but idk.
To have both a performant and distributed LLM trained from scratch, we still need a completely different architecture to do it, but this work is pretty cool and may mean that if nothing else, there is something the community can do to help move things forward.
Also, I was going to say the MoE routing on this technique was lacking, but I found a more recent paper[0] by Meta which fixes this with a final fine-tuning stage.
Base model was still trained in usual, non distributed way (by far the most cost).
Fine tunes were also trained in usual, non distributed way.
Proposed approach tries out several combinations to pick one that seems to perform better (where combination means ie. adhoc per layer operation).
Merging is not distributed as well.
There is not much distribution happening overall beyond the fact that fine tunes were trained independently.
Taking weight averages, weighted weight averages, trimming low diffs, doing arithmetic (subtracting base model from fine tune) etc. are all ad hoc trials throwing something on the wall and seeing what sticks the most. None of those work well.
For distributed training to work we'd have to have better algebra around this multidimentional/multilayer/multiconnectivity state. We don't have it and it has many problems, ie. evaluation is way too expensive. But solving "no need to rerun through whole training/benchmark corpus to see if my tiny change is better or not" problem will mean we solved problem of extracting essence of intelligence. If we do that, then hyper-efficient data centers will still keep beating out any distributed approach and it's all largely irrelevant because that's pure AGI already.
That's wrong. What you described is data parallelism and it would indeed be very tricky to e.g. sync gradients across machines. But this is not the only method of training neural nets (transformers or any other kind) in parallel. If we'd like to train, say, a human brain complexity level model with 10^15 parameters, we'd need a model parallelism approach anyways. It introduces a bit of complexity since you need to make sure that each distributed part of the model can run individually with roughly the same amount of compute, but you no longer need to worry about syncing anything (or have the entire state of anything on one machine). The real questions is if you can find enough people to run this who will never be able to run it themselves in the end, because inference alone will still require a supercluster. If you have access to that, you might as well train something on it today.
Lack of data parallelism is implied by computation that is performed.
You gradient descend on your state.
Each step needs to work on up to date state otherwise you're computing gradient descend from state that doesn't exist anymore and your computed gradient descent delta is nonsensical if applied to the most recent state (it was calculated on old one, direction that your computation calculated is now wrong).
You also can't calculate it without having access to the whole state. You have to do full forward and backward pass and mutate weights.
There aren't any ways of slicing and distributing that make sense in terms of efficiency.
The reason is that too much data at too high frequency needs to be mutated and then made readable.
That's also the reason why nvidia is focusing so much on hyper efficient interconnects - because that's the bottleneck.
Computation itself is way ahead of in/out data transfer. Data transfer is the main problem and going in the direction of architecture that dramatically reduces it by several orders of magnitude is just not the way to go.
If somebody solves this problem it'll mean they solved much more interesting problem – because it'll mean you can locally uptrain model and inject this knowledge into bigger one arbitrarily.
Your gradient descent is an operation on a directed acyclic graph. The graph itself is stateless. You can do parts of the graph without needing to have access to the entire graph, particularly for transformers. In fact this is already done today for training and inference of large models. The transfer bottleneck is for currently used model sizes and architectures. There's nothing to stop you from building a model so complex that compute itself becomes the bottleneck rather than data transfer. Except its ultimate usability of course, as I already mentioned.
Your DAG is big. It's stateless for single pass. Next one doesn't operate on it anymore, it operates on new, updated one from previous step. It has fully connected sub DAGs.
There is nothing stopping you from distributing assembly/machine code for CPU instructions, yet nobody does it because it doesn't make sense from performance perspective.
Or amazon driving truck from one depo to other to unload one package at a time to "distribute" unloading because "distributing = faster".
Yes, if there was something interesting there you'd think since 2017 something would happen. Reinforcement Learning (that is compared with) is not particularly famous for its performance (it is it's biggest issue and reason for not being used that much). Also transformers don't use it at all.
OpenAI has turned for profit and stopped releasing any tehcnical details regarding architectures or training. So how do you know that nothing has happened? Because they didn't release it? Do you see the issue here?
Wonder if that can be avoided by modifying the training approach. Ideas offhand: group by topic, train a subset of weights per node; figure out which layers have the most divergence and reduce lr on those only.
Read his Wikipedia page and tell me he doesn’t sound like your run of the mill crypto scammer.
> He claims that he holds B.A. and M.A. degrees in mathematics and computer science from the University of Oxford.[7][8] However, according to him, he did not attend his graduation ceremony to receive his degrees, and therefore, he does not technically possess a BA or an MA.[7]
In the US attending your graduation ceremony has zero bearing on whether the university recognizes if you achieved a degree or not. Is the UK or Oxford different in this regard? Who cares if someone attended a ceremony. This sounds fraudulent at first glance. People with legit credentials don't need to add technicalities to their claim.
Kinda like Deltec's "Deputy CEO"? (Tether's bank), or even Deltec itself:
At the start of 2021, according to their website, it was a 55 year old bank. By the end of 2021, it was a 70 year old bank!
The bank's website is a WordPress site. And their customers must be unhappy - online banking hasn't worked for nearly two years at this point.
Anyway, their Deputy CEO gave this hilarious interview from his gaming rig. A 33 year old Deputy CEO, who by his LinkedIn claimed to have graduated HEC Lausanne in Switzerland with a Master of Science at the age of 15... celebrating his graduation by immediately being named Professor of Finance at a university in Lebanon. While dividing his spare time between running hedge funds in Switzerland and uhh... Jacksonville, FL.
The name of his fund? Indepedance [sic] Weath [sic] Management. Yeah, okay.
In this hilariously inept interview, he claimed that people's claims about Deltec's money movements being several times larger than all the banking in their country was due to them misunderstanding the country's two banking licenses, the names of which he "couldn't remember right now" (the Deputy CEO of a bank who can't remember the name of banking licenses), and he "wasn't sure which one they had, but we might have both".
Once the ridicule and all this started piling on, within 24 hours, he was removed from the bank's website leadership page. When people pointed out how suspicious that looked, he was -re-added-.
The bank then deleted the company's entire website and replaced it with a minimally edited WordPress site, where most of the links and buttons were non-functional and remained so for months thereafter.
I mean fuck it, if the cryptobros want to look at all that and say "seems legit to me", alright, let em.
I didn’t go to Oxford, but going to your graduation ceremony isn’t usually a requirement for possessing a BA. The university just mails your diploma to you.
Pretty simple background check would answer that question. If he’s claiming those credentials without actually having them I would assume it be common knowledge by now.
Someone became a US House Rep while lying about an education they did not have and a completely falsified resume. I wouldn't be so quick to assume that if he was lying everyone would know by now.
but now that the crypto boys are back en vogue and are returning from hibernation / ai-vacations due to price levels you can combine 2 hype trends into one and capture the imagination & wallets of 2 intersecting circles of fools!
so if these days someone is talking about decentralized anything i'd bet it involves coinshit again
Obviously this is the polite way to send him off given the latest news about his leadership, but this rationale doesn't track.