Maybe we shouldn't be running these models on laptops with their thermally constrained form factor, and we shouldn't expect quick inference on a par with a large cloud-based platform either, at least not for near-SOTA model quality. It's still worth it to avoid becoming massively reliant on centralized services.
I have a 5070 12 GB laptop GPU and can hit 72 tokens per second in the first couple thousand tokens before dropping to mid-high 50s after about 15k context.
This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.
I don't have enough system RAM to properly handle the large context windows so I don't use local models.
The switches are all in the -h of llama.cpp (although the maintainers have a tendency to use the word in its definition). The actual values are essentially just what alibaba recommends. So you just need their model card. I would not call it highly optimized, more appropriately tuned.
I found every possible flag and its description including CUDA related environment variables and went back and iterated with Claude Opus 4.8 High until every single flag mattered above the temp one.
Same experience on M4 Max .. but quality of qwen still leaves so much to be desired after getting used to virtually unlimited tokens at work. Many people on this (and similar) thread seem to believe local models would inevitably improve, and I want to believe this too, but I don’t see this ever happening without growing in size
You're 100% right and its even severe than that: I daily drive on xhigh. I really try to avoid it, but when reconciling APIs across two large codebases you really start pressing north of 200k. I find myself topping out at 800k sometimes and that's with careful context management. I actually had to drop to GPT 5.4 for 1M context in my subscription because GPT 5.5 tops out at 272k. Hitting 800k context is better than repeatedly hitting let's say 200k out of 272k with multiple rounds of compaction. I run Can's snapcompact and while its better than normal compaction it still lobotomizes the model more than running with a very high context window.
I'd agree that the quality degrades a lot between Q8 and Q4, borderline unusable as they start to fail with tool calling syntax even. Personally I'd say Q8 is as low as you want to go.
q4 isn't rubbish, but it's a compromise for a good value, q6 is essentially a no-compromise quantization and it's what i recommend for MoEs in my experience for agentic workflows
Can you comment on the quality and accuracy of it? People have managed to run Gemma 26b without GPU on old CPUs but I don't think quality is anywhere close to what Gemma 12b offers.
> It's still worth it to avoid becoming massively reliant on centralized services.
This isn't really good enough. Many of us need to get things done in a pinch and if our employers are already getting used to the idea of paying for enterprise subscriptions to cloud llm's then the local option needs to be good
I have three laptops and a desktop. The desktop has 128GB of RAM.
I bought the memory at the end of the last year, and I was thinking, maybe this is excessive. No game will use that much memory, in a decade or more.
Now I realize it was one of the best purchases ever, I run qwen3-coder-next on it for just the cost of the electricity, while the coding and agents and whatever else is done in a laptop. Yeah, it is slower, I don't care. Infinite tokens is better than a few.
The cloud is another computer, but in this case it is mine =)