More

jmorgan · 2025-12-22T03:59:33 1766375973

The source is available here: https://github.com/ollama/ollama/tree/main/app

ekianjo · 2025-12-22T16:06:26 1766419586

Thanks, I stand corrected.

jmorgan · 2025-11-08T04:13:47 1762575227

The gpt-oss weights on Ollama are native mxfp4 (the same weights provided by OpenAI). No additional quantization is applied, so let me know if you're seeing any strange results with Ollama.

Most gpt-oss GGUF files online have parts of their weights quantized to q8_0, and we've seen folks get some strange results from these models. If you're importing these to Ollama to run, the output quality may decrease.

jmorgan · 2025-09-25T20:18:04 1758831484

We did consider building functionality into Ollama that would go fetch search results and website contents using a headless browser or similar. However we had a lot of worries about result quality and also IP blocking from Ollama creating crawler-like behavior. Having a hosted API felt like a fast path to get results into users' context window, but we are still exploring the local option. Ideally you'd be able to stay fully local if you want to (even when using capabilities like search)

jmorgan · 2025-08-14T18:32:09 1755196329

Amazing work. This model feels really good at one-off tasks like summarization and autocomplete. I really love that you released a quantized aware training version on launch day as well, making it even smaller!

canyon289 · 2025-08-14T18:36:46 1755196606

Thank you Jeffrey, and we're thrilled that you folks at Ollama partner with us and the open model ecosystem.

I personally was so excited to run ollama pull gemma3:270b on my personal laptop just a couple of hours ago to get this model on my devices as well!

blitzar · 2025-08-14T19:09:19 1755198559

> gemma3:270b

I think you mean gemma3:270m - Its Dos Comas not Tres Comas

freedomben · 2025-08-14T19:26:54 1755199614

Maybe it's 270m after Hooli's SOTA compression algorithm gets ahold of it

canyon289 · 2025-08-14T21:37:44 1755207464

Ah yes thank you. Even I still instinctively type B

jmorgan · 2025-08-06T05:42:39 1754458959

It should open ollama.com/connect – sorry about that. Feel free to message me jeff@ollama.com if you keep seeing issues

jmorgan · 2025-08-05T19:25:10 1754421910

Sorry about this. Re-downloading Ollama should fix the error

nodesocket · 2025-08-06T10:17:01 1754475421

Thanks for the reply and speedy patch Jeffery. Seems to be working now, except my 4060ti can’t hang lacking enough vram.

jmorgan · 2025-06-11T03:50:08 1749613808

Working on adding tool calling support to Magistral in Ollama. It requires a tokenizer change and also uses a new tool calling format. Excited to see the results of combining thinking + tool calling!

jmorgan · 2025-02-16T22:38:05 1739745485

This is a great point. apt-get would definitely be a better install experience and upgrade experience (that's what I would want too). Tailscale does this amazing well: https://tailscale.com/download/linux

The main issue for the maintainer team would be the work in hosting and maintaining all the package repos for apt, yum, etc, and making sure the we handle the case where nvidia/amd drivers aren't installed (quite common on cloud VMs). Mostly a matter of time and putting in the work.

For now every release of Ollama includes a minimal archive with the ollama binary and required dynamic libraries: https://github.com/ollama/ollama/blob/main/docs/linux.md#man.... But we could definitely do better

jmorgan · 2025-01-26T20:38:28 1737923908

Sorry this isn't more obvious. Ideally VRAM usage for the context window (the KV cache) becomes dynamic, starting small and growing with token usage, whereas right now Ollama defaults to a size of 2K which can be overridden at runtime. A great example of this is vLLM's PagedAttention implementation [1] or Microsoft's vAttention [2] which is CUDA-specific (and there are quite a few others).

1M tokens will definitely require a lot of KV cache memory. One way to reduce the memory footprint is to use KV cache quantization, which has recently been added behind a flag [3] and will 1/4 the memory footprint if 4-bit KV cache quantization is used (OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve)

[1] https://arxiv.org/pdf/2309.06180

[2] https://github.com/microsoft/vattention

[3] https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...

jmorgan · on Jan 12, 2025

Phi-4's architecture changed slightly from Phi-3.5 (it no longer uses a sliding window of 2,048 tokens [1]), causing a change in the hyperparameters (and ultimately an error at inference time for some published GGUF files on Hugging Face, since the same architecture name/identifier was re-used between the two models).

For the Phi-4 uploaded to Ollama, the hyperparameters were set to avoid the error. The error should stop occurring in the next version of Ollama [2] for imported GGUF files as well

In retrospect, a new architecture name should probably have been used entirely, instead of re-using "phi3".

[1] https://arxiv.org/html/2412.08905v1

[2] https://github.com/ollama/ollama/releases/tag/v0.5.5