This is beyond exciting. Welcome to the new reality!
On one hand, the resources required to run these models continues falling dramatically, thanks to the techniques discovered by researchers: GPTQ quantizing down to 4, 3, 2, even 1 bits! model pruning! hybrid vram offloading! better, more efficient architectures! 1-click finetuning on consumer hardware! Of course, the free lunches won't last forever, and this will level off, but it's still incredible.
And on the other side of the coin, the power of all computing devices continues its ever-upward exponential growth.
So you have a continuous lowering of requirements, combined with a continuous increase in available power... surely these two trends will collide, and I can only imagine what this stuff will be like at that intersection.
I would love to see an article on why quantising to low bits works. Seems counterintuitive to me. For example do that with a CD and it will sound awful. It took smarts to come up with mp3
format rather than just reduce number of bits.
A very broad answer is that large NNs are surprisingly resilient to inaccuracies, and it seems to be more pronounced as size grows larger. This is readily observable with LLaMA, where 4-bit quantization affects 7B worst of all.
Furthermore, model size is still the most significant contributor to output quality. E.g. vanilla llama-30b at 4-bit has better perplexity than any llama-13b finetune at 8-bit. Thus, if 4-bit lets you fit a larger model into available (V)RAM, you're still better off.
This is also why analog computing is seriously considered as a hardware architecture for LLMs: if you don't actually need bit-perfect matmul for things to work well, it can be done much simpler as an analog circuit, and then you can cram a lot more of them on the same chip. Any resulting quality loss would presumably be minor, and in any case would be more than compensated by the much larger model sizes allowed by such architecture.
Note, I'm not into ML though I've dabbled with NNs as a teen (before deep learning and all that).
The weights scale the output values from the previous layer, and the weighted values are summed. So it seems to me, instead of having a high-precision weight scale a single output, if you cloned the node in the previous layer M times, you could still have sqrt(M) bits of precision with 1-bit weights (or M bits, my brain is in weekend mode).
Thus a larger network with lower-precision weights should have the ability to have approximately the same precision as a smaller network with high-precision weights.
The larger network has more interconnects though, so seems like it could allow for more interesting space to explore during training, leading to better results.
A CD doesn’t work as an analogy. Think about it this way — if you build a model and don’t train it at all, it will still have the same number of parameters and take up the same amount of disk space.
We’re finding out that many models are undertrained for their sizes, and a good option is to post process them into smaller models by teaching a smaller model to mimic their output. Quantization effectively cuts down the model size as well. No loss in quality means that the model has not been trained enough to take advantage of the depth of precision that is available.
The analogy I'm currently favouring when talking to semi technical people is that LLMs are a map. We map words and phrases to a coordinate space.
We can use GPS to locate anything down to a sliding scale of decimal precision. There are only so many digits you need to locate a city or even a house.
I think a lot of it is that they are intentionally not measuring the "degradation" in quality experienced. I've noticed that 8 bit quantization of a model like dolly is significantly worse than the 32bit version of it. Seen similar results with using quantization with stable diffusion - the images really are worse, just so little at half percision that it's worth the trade-off.
What size model are you quantizing and comparing? The interesting thing about quantization, is how the larger the number of parameters, the less of a difference it makes to quantize the weights, even to an extreme degree when working with the largest parameter models. For small models is can be a disaster though.
So do they use the weights that are say 32 bit floats and just round them to the nearest something putting them in a range 0-255? I guess I can see how it could work if weights are all close to zero, so -1 to 1 is mapped to 0-255.
But I would have though the model relied on the higher accuracy during training. So losing that would screw it up.
That commenter is just wrong. We have empirical tests of quality loss due to quantization and even down to 4bits the loss is so negligible no human would ever be able to detect it. The loss only even registers on the benchmarks after generating tens of thousands of full context generations.
>So do they use the weights that are say 32 bit floats and just round them to the nearest
That's how they used to do it, and still how 8bit quantization works. That's called "Round to Nearest" or RTN quantization. That's not how it works anymore though.
The current algorithms (GPTQ, RTPQ, etc.) are more complex, including things like lining up the weights in order of least to greatest, placing them in bins (typically 32 or 128 weights per bin), and then computing an offset for each bin which is added to the RTN value. In some cases bins are identical and redundant and can be re-used without saving the same identical bin twice. These are just a few of the space saving measures which go into effective low-bit quantization without sacrificing quality.
It's very similar to state of the art video codecs or image compression algorithms. A raw photograph taken by my digital camera is 60MB, but a PNG of the same photo is 30x smaller at 2MB without a single artifact. It should be no surprise that we can reduce models by 4x, 8x, or even more without sacrificing quality.
I am not wrong, you are wrong. The fact is that NLP and other fields are FULL of people using automated benchmarks to claim that they are "state of the art". They are incentivized to downplay or trivialize any quality losses. Scores like ROUGE and BLEU are terrible and the whole community knows it, but they're still used because we have nothing "better".
I can actually see jpg artifacts on the jpg variants of the png files that I generate in Stable Diffusion, and the impacts from quantization down to 3,2, even 1 bit are FAR more than the impacts of switching from png to jpg.
Also, I actually have published peer reviewed research on LLMs and spend a majority of my time on this earth thinking about and coding for them. I know what I'm talking about and you shouldn't try to dismiss my criticisms so quickly.
Even the coomers at civitai have done polls where their own users find dreambooth models better than lora models on average, likely because the likeness of a person can be more properly trained when heavier/stronger methods are utilized. Same dynamic here with quantization.
Yes, as a model scales up in size quantization hurts it less. The claims made that extreme quantization is not noticable at all when the model is super large is just pathetically wrong.
> But I would have though the model relied on the higher accuracy during training. So losing that would screw it up.
Yes, during training, where you need to make tiny adjustments to weights. But as far as I understand it inference can still work well because of the sheer number of weights. Give a black-and-white image a high resolution and you can represent any shade of gray if you zoom out a bit.
At that intersection is the "Good Enough Model" that can solve 95% of our needs in full privacy and with complete customisability. The key point is being easy to run on every device. We'll still use proprietary, expensive models for the rest of 5%.
This of course is making AI more Open, the final piece is for people to start packing these in Windows and Mac installers so that the average computer user can make use of them. And for them to run on the crappy graphics card you are likely to get with your default configuration of PC!
Or for these to be at least available in every cloud and well understood so you are just paying AWS for
compute but not secret sauce (like Kubernetes for example which lets you walk and there is genuine competition)
Your racist comment is too idiotic to be authentic. Are you a spy working for the CCP?
The US big tech companies are already filled with Indian and Chinese workers. Are you trying to influence tech companies to hire more CCP agents so the technology can be siphoned off to your CCP overlords?
Mike_12345, that kind of language and accusation is not productive or helpful to the conversation. In fact, it can be quite harmful and perpetuate harmful stereotypes and xenophobia.
Let's focus on the important work being done by the RedPajama project and the potential benefits it can bring to a wide range of applications and users. Instead of promoting division and suspicion, let's work towards creating a more diverse and inclusive and collaborative environment in AI research and development.
On one hand, the resources required to run these models continues falling dramatically, thanks to the techniques discovered by researchers: GPTQ quantizing down to 4, 3, 2, even 1 bits! model pruning! hybrid vram offloading! better, more efficient architectures! 1-click finetuning on consumer hardware! Of course, the free lunches won't last forever, and this will level off, but it's still incredible.
And on the other side of the coin, the power of all computing devices continues its ever-upward exponential growth.
So you have a continuous lowering of requirements, combined with a continuous increase in available power... surely these two trends will collide, and I can only imagine what this stuff will be like at that intersection.