Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yea, this is currently the confusing part of running local models for newbies: Even after you have decided which model you want to run, and which org's quantizations to use (let's just assume Unsloth's for example), there are often dozens of quantizations offered, and choosing among them is confusing.

Say you have a GPU with 20GB of VRAM. You're probably going to be able to run all the 3-bit quantizations with no problem, but which one do you choose? Unsloth offers[1] four of them: UD-IQ3_XXS, Q3_K_S, Q3_K_M, UD-Q3_K_XL. Will they differ significantly? What are each of them good at? The 4-bit quantizations will be a "tight squeeze" on your 20GB GPU. Again, Unsloth steps up to the plate with seven(!!) choices: IQ4_XS, Q4_K_S, IQ4_NL, Q4_0, Q4_1, Q4_K_M, UD-Q4_K_XL. Holy shit where do I even begin? You can try each of them to see what fits on your GPU, but that's a lot of downloading, and then...

Once you [guess and] commit to one of the quantizations and do a gigantic download, you're not done fiddling. You need to decide at the very least how big a context window you need, and this is going to be trial and error. Choose a value, try to load the model, if it fails, you chose too large. Rinse and repeat.

Then finally, you're still not done. Don't forget the parameters: temperature, top_p, top_k, and so on. It's bewildering!

1: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF



We made Unsloth Studio which should help :)

1. Auto best official parameters set for all models

2. Auto determines the largest quant that can fit on your PC / Mac etc

3. Auto determines max context length

4. Auto heals tool calls, provides python & bash + web search :)


Yea, I actually tried it out last time we had one of these threads. It's undeniably easy to use, but it is also very opinionated about things like the directory locations/layouts for various assets. I don't think I managed to get it to work with a simple flat directory full of pre-downloaded models on an NFS mount to my NAS. It also insists on re-downloading a 3GB model every time it is launches, even after I delete the model file. I probably have to just sit down and do some Googleing/searching in order to rein the software in and get it to work the way I want it to on my system.


Oh my apologies I didn't respond - if only HN had a notifier haha

Oh yes we added a custom folder button which can pull .gguf files for now from any folder - it supports LM Studio and Ollama ones - but afreed it's still a mess.

One of the goals is to somehow quick search for .gguf folders, and add recommended folders - we currently have folders for Ollama and LM Studio for eg


Sadly doesn't support fine tuning on AMD yet which gave me a sad since I wanted to cut one of these down to be specific domain experts. Also running the studio is a bit of a nightmare when it calls diskpart during its install (why?)


Apologies as well didn't reply sooner - Studio supports AMD out of the box now! We worked with AMD to make it work! One thing that is still missing is pre-compiled AMD ROCM binaries, which we're trying to see if we can integrate that.

Interesting on diskpart - let me check and get back to you [EDIT] - visual studio build tools, python 3.13, git, cmake, node.js are all msi-based installers - so these are likely the culprits on using diskpart - essentially MSI installers check if there's enough disk space before installing items


Thanks for that. Did you notice that the unsloth/unsloth docker image is 12GB? Does it embed CUDA libraries or some default models that justifies the heavy footprint?


Hey so sorry didn't reply sooner - yes the docker used to be I think 4-8GB ish since CUDA sadly itself is 4GB I think, and PyTorch takes the rest. So unfortunately the Unsloth Docker image has ballooned due to this. We tried reducing it as much as possible, but it's hard :( https://hub.docker.com/r/vllm/vllm-openai/tags for eg is around 11GB ish, ad we're 13.6GB ish.

We'll try our best to compress it more, but it's tough


I applaud that you recently started providing the KL divergence plots that really help understand how different quantizations compare. But how well does this correlate with closed loop performance? How difficult/expensive would it be to run the quantizations on e.g. some agentic coding benchmarks?


Hey! Sorry for not replying sooner - yes we'll keep publishing more KLD - sadly some are saying we are "optimizing" for KLD now since we posted so many haha - but the whole purpose of quantization is to match the BF16 logits as much as possible whilst reducing disk space (ie reduce KLD).

In general so this is funny and a quirk of quantization - sometimes 8bit, 4bit models do BETTER on downstream benchmarks (SWE Bench for eg), since sometimes rounding can actually somehow act as a "regularization" method (this is just my hunch).

So KLD isn't that expensive, since we leverage the trick of causal attention - since causal attention is lower triangular, we can do 1 forward pass on the enter text (say 2048 tokens), and you attain logits for the prediction for every token's position - so this is O(N^2).

However coding benchmarking require actual inference, and cannot use the causal attention trick, and it's best to run them 10 times since temperature = 1.0 is not deterministic - and take an average. We plan to maybe do something like https://marginlab.ai/trackers/claude-code/, which takes a random sample and does it over time.


Is unsloth working on managing remote servers, like how vscode integrates with a remote server via ssh?


Hey sorry on the delay - we just added API support, so you can access a remote server - it includes optional python, tool call, bash and web search support if you enable them.

For SSH - we haven't yet done that - for now we have a SHA256 encryption approach, but it's not SSH yet. HTTPS will also sadly have to be the end user's setup process as well - we plan to make it better soon!


Lmstudio Link is GREAT for that right now


Oh yes LM Link is cool!


what are you using for web search?


We use Duck Duck Go - sorry on the delayed response as well


Great project! Thank you for that!


Thank you and appreciate it! Sorry on the delayed reply as well


> Say you have a GPU with 20GB of VRAM. You're probably going to be able to run all the 3-bit quantizations with no problem, but which one do you choose? Unsloth offers[1] four of them: UD-IQ3_XXS, Q3_K_S, Q3_K_M, UD-Q3_K_XL

There are actually two problems with this:

First, the 3-bit quants are where the quality loss really becomes obvious. You can get it to run, but you’re not getting the quality you expected. The errors compound over longer sessions.

Second, you need room for context. If you have become familiar with the long 200K contexts you get with SOTA models, you will not be happy with the minimal context you can fit into a card with 16-20GB of RAM.

The challenge for newbies is learning to identify the difference between being able to get a model to run, and being able to run it with useful quality and context.


Qwen3.5 series is a little bit of an exception to the general rule here. It is incredibly kv cache size efficient. I think the max context (262k) fits in 3GB at q8 iirc. I prefer to keep the cache at full precision though.


I just tested it and have to make a correction. With llama.cpp, 262144 tokens context (Q8 cache) used 8.7 GB memory with Qwen3.6 27B. Still very impressive.


The MoE variants are more cache efficient. Here from Qwen3.6 35B A3B MoE with 256k (262144) context at full F16 (so no cache quality loss):

  llama_kv_cache: size = 5120.00 MiB (262144 cells,  10 layers,  4/1 seqs), K (f16): 2560.00 MiB, V (f16): 2560.00 MiB
The MXFP4-quantized variant from Unsloth just fits my 5090 with 32GB VRAM at 256k context.

Meanwhile here's for Qwen 3.6 27B:

  llama_kv_cache: size = 3072.00 MiB ( 49152 cells,  16 layers,  4/1 seqs), K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
So 16 tokens per MiB for the 27B model vs about 51 tokens per MiB for the 35B MoE model.

I went for the Q5 UD variant for 27B so could just fit 48k context, though it seems if I went for the Q4 UD variant I could get 64k context.

That said I haven't tried the Qwen3.6 35B MoE to figure out if it can effectively use the full 256k context, that varies from model to model depending on the model training.


I found the KLD benchmark image at the bottom of https://unsloth.ai/docs/models/qwen3.6 to be very helpful when choosing a quant.


Yea, I'm also kind of jealous of Apple folks with their unified RAM. On a traditional homelab setup with gobs of system RAM and a GPU with relatively little VRAM, all that system RAM sits there useless for running LLMs.


That "traditional" setup is the recommended setup for running large MoE models, leaving shared routing layers on the GPU to the extent feasible. You can even go larger-than-system-RAM via mmap, though at a non-trivial cost in throughput.


Strix Halo is another option


qwen3.5 27b w/ 4bit quant works reasonably on a 3090.


HuggingFace has a nice UI where you can save your specs to your account and it will display a checkmark/red X next to every unsloth quantization to estimate if it will fit.


Evaluating different quant levels for your use case is a problem you can pretty reliably throw at a coding agent and leave overnight though. At least, it should give you a much smaller shortlist.


To add more complexity to the picture, you can run MoE models at a higher quant than you might think, because CPU expert offload is less impactful than full layer offload for dense models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: