If you want to have an opinion on it, just install lmstudio and run the q8_0 ver...

esafak · 2025-08-06T19:22:47 1754508167

Thank you. To spare Mac readers time:

mlx 4bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...

mlx 5bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...

mlx 6bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...

mlx 8bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...

edit: corrected the 4b link

belter · 2025-08-06T21:06:48 1754514408

This comment saved 3 tons of CO2

ckcheng · 2025-08-06T19:34:11 1754508851

Did you mean mlx 4bit:

https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...

magnat · 2025-08-06T19:35:08 1754508908

> if you run it at the full 262144 tokens of context youll need ~65gb of ram

What is the relationship between context size and RAM required? Isn't the size of RAM related only to number of parameters and quantization?

Gracana · 2025-08-06T20:30:18 1754512218

The context cache (or KV cache) is where intermediate results are stored. One for each output token. Its size depends on the model architecture and dimensions.

KV cache size = 2 * batch_size * context_len * num_key_value_heads * head_dim * num_layers * element_size. The "2" is for the two parts, key and value. Element size is the precision in bytes. This model uses grouped query attention, which reduces num_key_value_heads compared to a multi head attention (MHA) model.

With batch size 1 (for low-latency single-user inference), 32k context (recommended in the model card), fp16 precision:

2 * 1 * 32768 * 8 * 128 * 36 * 2 = 4.5GiB.

I think, anyway. It's hard to keep up with this stuff. :)

wkat4242 · 2025-08-07T03:42:36 1754538156

Yes but you can quantise the KV cache too just like you can the weights.

hnuser123456 · 2025-08-06T20:32:12 1754512332

A 24GB GPU can run a ~30b parameter model at 4bit quantization at about 8k-12k context length before every GB of VRAM is occupied.

iamnotagenius · 2025-08-07T06:50:13 1754549413

Not quite true. Depends on number of KV heads. GLM4 32b at IQ4 quant and Q8 context can run full context with only 20GiB VRAM.

DSingularity · 2025-08-06T19:50:58 1754509858

No. Your KV cache is kept in memory also.

aitchnyu · 2025-08-07T07:23:42 1754551422

Whats the space complexity for context size? And who is trying to drop it into linear complexity?

0x457 · 2025-08-06T22:03:05 1754517785

I mean...where do you think context is stored?

Aeroi · 2025-08-06T19:15:07 1754507707

how about on apple silicon for the iphone

jasonjmcghee · 2025-08-06T19:21:03 1754508063

https://joejoe1313.github.io/2025-05-06-chat-qwen3-ios.html