Aren't these models consistently quite large and hard to run locally? It's possible that future Ollama releases will allow you to dynamically manage VRAM memory in a way that enables these models to run with acceleration on even modest GPU hardware (such as by dynamically loading layers for a single 'expert' into VRAM, and opportunistically batching computations that happen to rely on the same 'expert' parameters - essentially doing manually what mmap does for you in CPU-only inference) but these 'tricks' will nonetheless come at non-trivial cost in performance.