It's still somewhat faster if you benchmark it. I assume the os is doing good enough prefetching in the mmap case to hide the loads from disk mostly. So it's not just hiding the initial load of 30gb from disk.
Obviously if you're swapping because you don't have enough memory to hold the model in RAM, the mmap version is going to be much faster, since you don't need to swap anything out to disk but just discard the page and re-read from disk if you need it again later.
> So it's not just hiding the initial load of 30gb from disk.
The issue is typically that that initial load involves some sort of transformation - parsing, instantiating structures, etc. If you can arrange it so that the data is stored in the format you actually need it in memory, then you can skip that entire transformation phase.
I don’t know if that’s what’s been done with llama.cop though.
Obviously if you're swapping because you don't have enough memory to hold the model in RAM, the mmap version is going to be much faster, since you don't need to swap anything out to disk but just discard the page and re-read from disk if you need it again later.