I get ~5 tokens/s on an M4 with 32G of RAM, using: llama-server \ -hf unsloth/Qw...

danielhanchen · 2026-04-22T15:49:29 1776872969

We also made some dynamic MLX ones if they help - it might be faster for Macs, but llama-server definitely is improving at a fast pace.

https://huggingface.co/unsloth/Qwen3.6-27B-UD-MLX-4bit

DarmokJalad1701 · 2026-04-22T17:51:34 1776880294

What exactly does the .sh file install? How does it compare to running the same model in, say, omlx?

danielhanchen · 2026-04-30T06:13:50 1777529630

Sorry on the delay - so it installs https://github.com/Blaizzy/mlx-vlm and other components and sets up the commands - you don't need to use it but we thought it might be easier for folks

dunb · 2026-04-22T15:46:39 1776872799

Why use --fit on on an M4? My understanding was that given the unified memory, you should push all layers to the GPU with --n-gpu-layers all. Setting --flash-attn on and --no-mmap may also get you better results.

halJordan · 2026-04-23T23:38:01 1776987481

Meaningless question, fit will put everything on the gpu if it fits. Fa is default on. No-mmap is not an inference tradeoff and if you do turn it off you need to turn on direct io via -dio

What he should actually do is enable speculative decoding

fuomag9 · 2026-04-22T22:52:43 1776898363

I confirm with the GGUF version at q4, 35B-A3B starts going in thinking loops at 60k basically

kpw94 · 2026-04-22T16:48:51 1776876531

When you say tok/s here are you describing the prefill (prompt eval) token/s or the output generation tok/s?

(Btw I believe the "--jinja" flag is by default true since sometime late 2025, so not needed anymore)

benob · 2026-04-22T19:19:29 1776885569

Here is llama-bench on the same M4:

  | model                    |       size |     params | backend    | threads |            test |                  t/s |
  | ------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
  | qwen35 27B Q4_K_M        |  15.65 GiB |    26.90 B | BLAS,MTL   |       4 |           pp512 |         61.31 ± 0.79 |
  | qwen35 27B Q4_K_M        |  15.65 GiB |    26.90 B | BLAS,MTL   |       4 |           tg128 |          5.52 ± 0.08 |
  | qwen35moe 35B.A3B Q3_K_M |  15.45 GiB |    34.66 B | BLAS,MTL   |       4 |           pp512 |        385.54 ± 2.70 |
  | qwen35moe 35B.A3B Q3_K_M |  15.45 GiB |    34.66 B | BLAS,MTL   |       4 |           tg128 |         26.75 ± 0.02 |

So ~60 for prefill and ~5 for output on 27B and about 5x on 35B-A3B.

zargon · 2026-04-22T17:24:19 1776878659

If someone doesn't specifically say prefill then they always mean decode speed. I have never seen an exception. Most people just ignore prefill.

kpw94 · 2026-04-22T17:40:01 1776879601

But isn't the prefill speed the bottleneck in some systems* ?

Sure it's order of magnitude faster (10x on Apple Metal?) but there's also order of magnitude more tokens to process, especially for tasks involving summarization of some sort.

But point taken that the parent numbers are probably decode

* Specifically, Mac metal, which is what parent numbers are about

zargon · 2026-04-22T18:10:51 1776881451

Yes, definitely it's the bottleneck for most use cases besides "chatting". It's the reason I have never bought a Mac for LLM purposes.

It's frustrating when trying to find benchmarks because almost everyone gives decode speed without mentioning prefill speed.

mercutio2 · 2026-04-22T23:37:06 1776901026

oMLX makes prefill effectively instantaneous on a Mac.

Storing an LRU KV Cache of all your conversations both in memory, and on (plenty fast enough) SSD, especially including the fixed agent context every conversation starts with, means we go from "painfully slow" to "faster than using Claude" most of the time. It's kind of shocking this much perf was lying on the ground waiting to be picked up.

Open models are still dumber than leading closed models, especially for editing existing code. But I use it as essentially free "analyze this code, look for problem <x|y|z>" which Claude is happy to do for an enormous amount of consumed tokens.

But speed is no longer a problem. It's pretty awesome over here in unified memory Mac land :)

cyanydeez · 2026-04-22T18:16:40 1776881800

Using opencode and Qwen-Coder-Next I get it reliably up to about 85k before it takes too long to respond.

I tried the other qwen models and the reasoning stuff seems to do more harm than good.

wuschel · 2026-04-22T17:30:20 1776879020

How is the quality of model answers to your queries? Are they stable over time?

I am wondering how to measure that anyway.