Take in mind that executable code on Linux is mapped in from disk, either from an executable or from a shared library. So every application's performance on Linux is heavily dependent on the disk cache.
If you have no swap, anonymous pages (stacks, heaps) cannot be evicted to disk and the thrashing is forced onto the disk cache. So the hard lock-up occurs earlier.
If you want to delay the lock-up as much as possible, enable swap and set swappiness high.
Is this really from real world experience? And is that only in certain conditions?
My experience is really the opposite from this. Thrashing was a normal occurrence on my desktop, where it went into non-recovery and a manual hard reset.
And also, with 16GB ram and 4GB swap, my running applications got moved to swap. Switching tabs in Firefox will be slow because it has to come from swap. My swappiness was set to 1 that it shouldn't swap, but it did swap always.
Now without swap and using early_oom everything is fine. When I see in /proc/vmstat that there has been a kill, it is time to reboot.
On my laptop though, my usecase is different. It only has 2GB ram, so I prefer swap over a hard kill. And I reboot it more than once a day if I am using it.
Yes, i learned it the hard way when debugging production outages. Gitlab's Praefact recommended VM sizes were too small for our usecase and we had, per provisioning defaults, no swap on all machines. 150 MB of binaries in virtual memory, only 50 MB disk cache left, this is where it made click for me.
If you want a hard OOM kill, i don't know. I'm only talking about the I/O lockup that happens in these situations.
Thank you. So ram was quite minimal, just barely enough to run the applications, and almost none left for disk io. On my laptop that is the same situation. On my desktop however, I have way more ram than needed to run the applications. So I assume it is dependent on the situation if you want (need) swap or not.
This is the correct answer that needs to be at the top. No swap doesn't mean OOM killer magically kicks in earlier. It just means the anonymous memory has no where to go and your executable pages get evicted and then you are really hosed.
Unfortunately no crash. This is the dog slow case. Too slow for an SSH session to be able to start. But the machine might catch itself and get back onto tracks without an OOM happening.
I went with enabling swap and monitoring for page pressure. In the end of the day the disk cache for the application data is also highly performance critical.
How the lock-up looks in practice: RAM is mostly full with heap/stacks, there are a few MB available for disk cache and all processes fight each other to have their own code mapped into the remaining MB. Reading disk I/O is fully saturated at this point.
If you have no swap, anonymous pages (stacks, heaps) cannot be evicted to disk and the thrashing is forced onto the disk cache. So the hard lock-up occurs earlier.
If you want to delay the lock-up as much as possible, enable swap and set swappiness high.