At what context sizes? I've just run the same prompt and query on my RTX3080 wit...

At what context sizes? I've just run the same prompt and query on my RTX3080 with openwebui as frontend.

When I set the context size to 2048 (openwebui's default), the inference is almost twice as fast as when I set it to 4096. I can't set the conext size any higher because my GPU only has 12GB of RAM and ollama crashes for larger context sizes.

Still, I find that thoroughly odd. Using the larger conetext size (4096), the GPU usage is only 50% as seen in nvtop. I have no idea why.