Yeah sure, but if you do that you are heavily dropping the token/s for a single user. The only way to recover from that is continuous batching. This could still be interesting if the KV caches of all users fit in SRAM though.
> but if you do that you are heavily dropping the token/s for a single user.
I don’t follow what you are saying and what “that” is specifically. Assuming it’s referencing using HBM and not just SRAM, this is not optional on a GPU, SRAM is many order of magnitudes too small. Data is constantly flowing between HBM and SRAM by design, and to get data in/out of your GPU you have to go through HBM first, you can’t skip that.
And while it is quite massive on a Cerebras system it is also still too small for very large models.