Yea, this is currently the confusing part of running local models for newbies: E...

danielhanchen · 2026-04-22T16:12:47 1776874367

We made Unsloth Studio which should help :)

1. Auto best official parameters set for all models

2. Auto determines the largest quant that can fit on your PC / Mac etc

3. Auto determines max context length

4. Auto heals tool calls, provides python & bash + web search :)

ryandrake · 2026-04-22T16:50:02 1776876602

Yea, I actually tried it out last time we had one of these threads. It's undeniably easy to use, but it is also very opinionated about things like the directory locations/layouts for various assets. I don't think I managed to get it to work with a simple flat directory full of pre-downloaded models on an NFS mount to my NAS. It also insists on re-downloading a 3GB model every time it is launches, even after I delete the model file. I probably have to just sit down and do some Googleing/searching in order to rein the software in and get it to work the way I want it to on my system.

danielhanchen · 2026-04-30T05:07:38 1777525658

Oh my apologies I didn't respond - if only HN had a notifier haha

Oh yes we added a custom folder button which can pull .gguf files for now from any folder - it supports LM Studio and Ollama ones - but afreed it's still a mess.

One of the goals is to somehow quick search for .gguf folders, and add recommended folders - we currently have folders for Ollama and LM Studio for eg

hypercube33 · 2026-04-22T17:49:51 1776880191

Sadly doesn't support fine tuning on AMD yet which gave me a sad since I wanted to cut one of these down to be specific domain experts. Also running the studio is a bit of a nightmare when it calls diskpart during its install (why?)

danielhanchen · 2026-04-30T05:09:44 1777525784

Apologies as well didn't reply sooner - Studio supports AMD out of the box now! We worked with AMD to make it work! One thing that is still missing is pre-compiled AMD ROCM binaries, which we're trying to see if we can integrate that.

Interesting on diskpart - let me check and get back to you [EDIT] - visual studio build tools, python 3.13, git, cmake, node.js are all msi-based installers - so these are likely the culprits on using diskpart - essentially MSI installers check if there's enough disk space before installing items

Zopieux · 2026-04-22T20:31:04 1776889864

Thanks for that. Did you notice that the unsloth/unsloth docker image is 12GB? Does it embed CUDA libraries or some default models that justifies the heavy footprint?

danielhanchen · 2026-04-30T05:12:17 1777525937

Hey so sorry didn't reply sooner - yes the docker used to be I think 4-8GB ish since CUDA sadly itself is 4GB I think, and PyTorch takes the rest. So unfortunately the Unsloth Docker image has ballooned due to this. We tried reducing it as much as possible, but it's hard :( https://hub.docker.com/r/vllm/vllm-openai/tags for eg is around 11GB ish, ad we're 13.6GB ish.

We'll try our best to compress it more, but it's tough

WanderPanda · 2026-04-22T18:23:52 1776882232

I applaud that you recently started providing the KL divergence plots that really help understand how different quantizations compare. But how well does this correlate with closed loop performance? How difficult/expensive would it be to run the quantizations on e.g. some agentic coding benchmarks?

danielhanchen · 2026-04-30T05:19:08 1777526348

Hey! Sorry for not replying sooner - yes we'll keep publishing more KLD - sadly some are saying we are "optimizing" for KLD now since we posted so many haha - but the whole purpose of quantization is to match the BF16 logits as much as possible whilst reducing disk space (ie reduce KLD).

In general so this is funny and a quirk of quantization - sometimes 8bit, 4bit models do BETTER on downstream benchmarks (SWE Bench for eg), since sometimes rounding can actually somehow act as a "regularization" method (this is just my hunch).

So KLD isn't that expensive, since we leverage the trick of causal attention - since causal attention is lower triangular, we can do 1 forward pass on the enter text (say 2048 tokens), and you attain logits for the prediction for every token's position - so this is O(N^2).

However coding benchmarking require actual inference, and cannot use the causal attention trick, and it's best to run them 10 times since temperature = 1.0 is not deterministic - and take an average. We plan to maybe do something like https://marginlab.ai/trackers/claude-code/, which takes a random sample and does it over time.

cyanydeez · 2026-04-22T18:10:33 1776881433

Is unsloth working on managing remote servers, like how vscode integrates with a remote server via ssh?

danielhanchen · 2026-04-30T05:21:04 1777526464

Hey sorry on the delay - we just added API support, so you can access a remote server - it includes optional python, tool call, bash and web search support if you enable them.

For SSH - we haven't yet done that - for now we have a SHA256 encryption approach, but it's not SSH yet. HTTPS will also sadly have to be the end user's setup process as well - we plan to make it better soon!

kristjansson · 2026-04-22T18:18:34 1776881914

Lmstudio Link is GREAT for that right now

danielhanchen · 2026-04-30T05:21:14 1777526474

Oh yes LM Link is cool!

jbellis · 2026-04-22T17:58:45 1776880725

what are you using for web search?

danielhanchen · 2026-04-30T05:21:48 1777526508

We use Duck Duck Go - sorry on the delayed response as well

wuschel · 2026-04-22T16:43:25 1776876205

Great project! Thank you for that!

danielhanchen · 2026-04-30T05:21:28 1777526488

Thank you and appreciate it! Sorry on the delayed reply as well

Aurornis · 2026-04-22T16:19:22 1776874762

> Say you have a GPU with 20GB of VRAM. You're probably going to be able to run all the 3-bit quantizations with no problem, but which one do you choose? Unsloth offers[1] four of them: UD-IQ3_XXS, Q3_K_S, Q3_K_M, UD-Q3_K_XL

There are actually two problems with this:

First, the 3-bit quants are where the quality loss really becomes obvious. You can get it to run, but you’re not getting the quality you expected. The errors compound over longer sessions.

Second, you need room for context. If you have become familiar with the long 200K contexts you get with SOTA models, you will not be happy with the minimal context you can fit into a card with 16-20GB of RAM.

The challenge for newbies is learning to identify the difference between being able to get a model to run, and being able to run it with useful quality and context.

zargon · 2026-04-22T17:58:52 1776880732

Qwen3.5 series is a little bit of an exception to the general rule here. It is incredibly kv cache size efficient. I think the max context (262k) fits in 3GB at q8 iirc. I prefer to keep the cache at full precision though.

zargon · 2026-04-22T20:47:49 1776890869

I just tested it and have to make a correction. With llama.cpp, 262144 tokens context (Q8 cache) used 8.7 GB memory with Qwen3.6 27B. Still very impressive.

magicalhippo · 2026-04-23T01:09:34 1776906574

The MoE variants are more cache efficient. Here from Qwen3.6 35B A3B MoE with 256k (262144) context at full F16 (so no cache quality loss):

  llama_kv_cache: size = 5120.00 MiB (262144 cells,  10 layers,  4/1 seqs), K (f16): 2560.00 MiB, V (f16): 2560.00 MiB

The MXFP4-quantized variant from Unsloth just fits my 5090 with 32GB VRAM at 256k context.

Meanwhile here's for Qwen 3.6 27B:

  llama_kv_cache: size = 3072.00 MiB ( 49152 cells,  16 layers,  4/1 seqs), K (f16): 1536.00 MiB, V (f16): 1536.00 MiB

So 16 tokens per MiB for the 27B model vs about 51 tokens per MiB for the 35B MoE model.

I went for the Q5 UD variant for 27B so could just fit 48k context, though it seems if I went for the Q4 UD variant I could get 64k context.

That said I haven't tried the Qwen3.6 35B MoE to figure out if it can effectively use the full 256k context, that varies from model to model depending on the model training.

smallerize · 2026-04-22T18:06:22 1776881182

I found the KLD benchmark image at the bottom of https://unsloth.ai/docs/models/qwen3.6 to be very helpful when choosing a quant.

ryandrake · 2026-04-22T16:46:10 1776876370

Yea, I'm also kind of jealous of Apple folks with their unified RAM. On a traditional homelab setup with gobs of system RAM and a GPU with relatively little VRAM, all that system RAM sits there useless for running LLMs.

zozbot234 · 2026-04-22T16:49:28 1776876568

That "traditional" setup is the recommended setup for running large MoE models, leaving shared routing layers on the GPU to the extent feasible. You can even go larger-than-system-RAM via mmap, though at a non-trivial cost in throughput.

khimaros · 2026-04-22T17:40:31 1776879631

Strix Halo is another option

jmspring · 2026-04-22T21:26:13 1776893173

qwen3.5 27b w/ 4bit quant works reasonably on a 3090.

mudkipdev · 2026-04-22T21:56:50 1776895010

HuggingFace has a nice UI where you can save your specs to your account and it will display a checkmark/red X next to every unsloth quantization to estimate if it will fit.

dannyw · 2026-04-23T00:35:50 1776904550

Evaluating different quant levels for your use case is a problem you can pretty reliably throw at a coding agent and leave overnight though. At least, it should give you a much smaller shortlist.

regularfry · 2026-04-22T22:02:48 1776895368

To add more complexity to the picture, you can run MoE models at a higher quant than you might think, because CPU expert offload is less impactful than full layer offload for dense models.