Then load the image with /image image.png inside the chat, and chat away!
EDIT: -ngl -1 is not needed anymore for Metal backends (CUDA still yes) (llama.cpp will auto offload to the GPU by default!). -1 means all GPU layers offloaded to the GPU.
If you are on a Mac, give https://recurse.chat/ a try. As simple as download the model and start chatting. Just added the new multimodal support in LLaMA.cpp.
Give https://docs.openwebui.com/ a look, you'll be able to access it by using your desktops IP while on your laptop (providing you're on the same network).
You'll have to compile llama.cpp from source, and you should get a llama-mtmd-cli program.
I made some quants with vision support - literally run:
./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl -1
./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-12b-it-GGUF:Q4_K_XL -ngl -1
./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-27b-it-GGUF:Q4_K_XL -ngl -1
./llama.cpp/llama-mtmd-cli -hf unsloth/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL -ngl -1
Then load the image with /image image.png inside the chat, and chat away!
EDIT: -ngl -1 is not needed anymore for Metal backends (CUDA still yes) (llama.cpp will auto offload to the GPU by default!). -1 means all GPU layers offloaded to the GPU.