On llama server, the Q4_K_M is giving about 91k context on 24GB, which calculate...

sleepyeldrazi · 2026-04-23T09:33:01 1776936781

I have been getting good results with IQ4_NL and TurboQuant at 4bits on 24gb (3090). It easily fits 256k with that setup, but it starts slowing down quite a bit after 80-100k. Quality in my testing is also still good:

- Coding task test: https://github.com/sleepyeldrazi/llm_programming_tests/ - Design task test: https://github.com/sleepyeldrazi/llm-design-showcase

Coding was against minimax-m2.7 and glm-5, and the design against other small models