I have been using Qwen3.5-35B-A3B a lot in local testing, and it is by far the most capable model that could fit on my machine.
I think quantization technology has really upped its game around these models,
and there were two quants that blew me away
Mudler APEX-I-Quality.
then later I tried
Byteshape Q3_K_S-3.40bpw
Both made claims that seemed too good to be true, but I couldn't find any traces of lobotomization doing long agent coding loops.
with the byteshape quant I am up to 40+ t/s which is a speed that makes agents much more pleasant.
On an rtx 3060 12GB and 32GB of system ram, I went from slamming all my available memory to having like 14GB to spare.
Unfortunately, llama.cpp quantization technology has been stagnant for two years. The main quantization developer left or was kicked out of llama.cpp due to an attribution dispute. He created his own fork ik_llama.cpp where he has made multiple new and better quants.
unsloth and byteshape are just using and highlighting features that have been available the whole time. I am very invested in figuring out a solution to this dispute, or some way to get the new quants upstreamed.
Now that I have tried out on a few tasks, Qwen3.6 is a huge jump in capability.
It can make improvements to a project that qwen3.5 always struggled with.
I would say byteshape is smaller and faster, I can’t really notice a quality difference. But I haven’t used it as much as I only started using it a few days ago.
Mudler APEX-I-Quality. then later I tried Byteshape Q3_K_S-3.40bpw
Both made claims that seemed too good to be true, but I couldn't find any traces of lobotomization doing long agent coding loops. with the byteshape quant I am up to 40+ t/s which is a speed that makes agents much more pleasant. On an rtx 3060 12GB and 32GB of system ram, I went from slamming all my available memory to having like 14GB to spare.