Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Maybe we shouldn't be running these models on laptops with their thermally constrained form factor, and we shouldn't expect quick inference on a par with a large cloud-based platform either, at least not for near-SOTA model quality. It's still worth it to avoid becoming massively reliant on centralized services.
 help



I have a 5070 12 GB laptop GPU and can hit 72 tokens per second in the first couple thousand tokens before dropping to mid-high 50s after about 15k context.

This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.

I don't have enough system RAM to properly handle the large context windows so I don't use local models.

  # 1,257 tokens 17s 72.18 t/s

  $env:CUDA_DEVICE_SCHEDULE = "SPIN"
  cd D:\src\llama.cpp\
  .\build\bin\Release\llama-server.exe `
    --port 8080 `
    --host 127.0.0.1 `
    -m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
    -fitt 2048 `
    -c 98304 `
    -n 32768 `
    -fa on `
    -np 1 `
    --kv-unified `
    -ctk q8_0 `
    -ctv q8_0 `
    -ctkd q8_0 `
    -ctvd q8_0 `
    -ctxcp 64 `
    --mlock `
    --no-warmup `
    --spec-type draft-mtp `
    --spec-draft-n-max 2 `
    --spec-draft-p-min 0.1 `
    --chat-template-kwargs '{\"preserve_thinking\": true}' `
    --temp 0.6 `
    --top-p 0.95 `
    --top-k 20 `
    --min-p 0.0 `
    --presence-penalty 0.0 `
    --repeat-penalty 1.0

That’s useless without describing WHY you chose those flags, and how you did the optimisation…

The switches are all in the -h of llama.cpp (although the maintainers have a tendency to use the word in its definition). The actual values are essentially just what alibaba recommends. So you just need their model card. I would not call it highly optimized, more appropriately tuned.

I found every possible flag and its description including CUDA related environment variables and went back and iterated with Claude Opus 4.8 High until every single flag mattered above the temp one.

I get over 100 tok/s sustained on my M4 Max and M5 Max, in MacBook Pro's. LM Studio + MLX.

Same experience on M4 Max .. but quality of qwen still leaves so much to be desired after getting used to virtually unlimited tokens at work. Many people on this (and similar) thread seem to believe local models would inevitably improve, and I want to believe this too, but I don’t see this ever happening without growing in size

With Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf?

Also, funny lumping the M4 "and" the M5, I find them 15% to 45% different performance, depending.

And for a good deal of work, an M3 Studio Ultra outpaces the M4 and ties the M5 on single work at a time, outpaces both doing multiple work at a time.


> With Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf?

Ahh no I'm using the MLX version, it's about 5-10% faster than GGUFs in my experience.


That's a quant 4 which the thread OP specifically called out as rubbish.

The Q4_K_XL bit for those not in the know.


I typically find myself using a context of between 150-500k with GPT models so local models are simply not enough and I stopped using them.

That's way higher than their optimal ceiling (and absolutely suboptimal from a token cost point of view), why are you doing that?

You're 100% right and its even severe than that: I daily drive on xhigh. I really try to avoid it, but when reconciling APIs across two large codebases you really start pressing north of 200k. I find myself topping out at 800k sometimes and that's with careful context management. I actually had to drop to GPT 5.4 for 1M context in my subscription because GPT 5.5 tops out at 272k. Hitting 800k context is better than repeatedly hitting let's say 200k out of 272k with multiple rounds of compaction. I run Can's snapcompact and while its better than normal compaction it still lobotomizes the model more than running with a very high context window.

large contexts degrade the performance - attention doesn't work will for large windows like that and cloud models are kind of hacking it

local models do involve some context engineering to get it okay, but it's not that rough


Anyone calling Qwen3.6-35B-A3B-Q4_K_XL “rubish” has no idea what they are talking about.

I'd agree that the quality degrades a lot between Q8 and Q4, borderline unusable as they start to fail with tool calling syntax even. Personally I'd say Q8 is as low as you want to go.

He's probably calling me out for this comment https://news.ycombinator.com/item?id=48557579

q4 isn't rubbish, but it's a compromise for a good value, q6 is essentially a no-compromise quantization and it's what i recommend for MoEs in my experience for agentic workflows

Can you comment on the quality and accuracy of it? People have managed to run Gemma 26b without GPU on old CPUs but I don't think quality is anywhere close to what Gemma 12b offers.

> It's still worth it to avoid becoming massively reliant on centralized services.

This isn't really good enough. Many of us need to get things done in a pinch and if our employers are already getting used to the idea of paying for enterprise subscriptions to cloud llm's then the local option needs to be good


For me I use only cloud for work. But I'd never trust any of my personal data to it.

I have three laptops and a desktop. The desktop has 128GB of RAM.

I bought the memory at the end of the last year, and I was thinking, maybe this is excessive. No game will use that much memory, in a decade or more.

Now I realize it was one of the best purchases ever, I run qwen3-coder-next on it for just the cost of the electricity, while the coding and agents and whatever else is done in a laptop. Yeah, it is slower, I don't care. Infinite tokens is better than a few.

The cloud is another computer, but in this case it is mine =)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: