Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

At 4-bit quantization it should already fit quite nicely.


Unfortunately not with a reasonable context length.


I've got 139k context with the UD-Q4_K_XL on a 4090, q8_0 ctk/v. Could probably squeeze a little more but that's enough for me for the moment.


Hey, buddy! Can I bum a command line arg list off ya?


The model uses Gated DeltaNet and Gated Attention so the memory usage of the KV cache is very low, even at BF16 precision.


It really depends on what you think a reasonable context length is, but I can get 50k-60k on a 4090.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: