> That work did not address one other unfortunate characteristic of the OOM killer, though: its opinion of what is the least important process on the system tends to differ from that of the system's users.
My experience of the linux OOM killer is not that its opinion differs from mine but that it has no opinion at all for a long, long time after the system is in deep trouble. The OOM killer simply does not act quickly enough to save systems. Sadly it's not customisable but 'earlyoom' (packaged for debian and probably everything else) is. I turned it on when I was messing about when debugging a badly behaved bit of software which went into a memory allocation loop and have just left it on. It's saved me a few times and I now plan to leave it on forever.
Looks like oomd is an idea along the same lines but with slightly different goals. It's not in my distro so not an easy option for me.
This is my experience also. By the time the OOM killer kicks in, the system has been locked for 15 to 20 minutes already. If it was production you've already terminated the instance. If it's your laptop or desktop, you've already held the power button.
Fedora has earlyoom enabled by default but so far it hasn't saved me. I really need to look into configuring it. How did you get started? Man pages? Blog post?
earlyoom is enable on Fedora 32 and 33 Workstation edition; and 33 KDE spin. On Fedora 34, all editions and spins have systemd-oomd enabled. It does take some initial configuration of systemd service units since oomd works by cgroupsv2 based accounting and killing of entire cgroups, not PIDs. This work is still a work in progress, with uresourced setting up the initial resource allocations (with planned obsolescence). It should be safe to run uresourced on any edition or spin but right now it's only enabled by default on Workstation edition.
If the results you're getting aren't expected, there's still some chance it's a bug somewhere, so you should report it against systemd, attach `journalctl -b -o short-monotonic --no-hostname` or at least ~10 minutes of logging prior to the unexpected behavior you're reporting.
The manpage combined with some (forced) experimentation with the wonky code mentioned was enough for me. I run with this config as even the earlyoom defaults were not strict enough:
-r 30 -m 5 -s 80
I run with 16gb of memory and do use swap. In practice if swap is growing at all once memory is near full, I'm in trouble and action needs to be taken.
While that is true, you can invoke the OOM killer at any time you want by pressing Ctrl+Alt+Print("SysRq")+F(uck?).
Given the letter they picked it seems like they're fully aware of the deficiency. As to why it was never really addressed, that's a good question. The recent le9[1] patch addresses the most annoying symptom of running out of memory by keeping some amount of RAM reserved for clean pages (like static program code) so that those won't get swapped in and out so much. That greatly improves responsiveness under low memory conditions.
FWIW, I have a very distinct memory of the time OOM decided to kill init (pid 1), which was the most incredibly incompetent choice it possibly could have made ;P.
I used to experience this a long time ago, and I know many people who still do - but on my system (running an Ubuntu 18.04 derived distribution) the OOM killer takes at most 3-4 seconds to step in and kill whichever process is consuming the most memory. Does anyone know if Ubuntu/Debian tunes the OOM killer differently to try and stop this from happening?
The reason Linux systems lock up like that is because the kernel will let a process fill all memory up with dirty pages that need writeback, then as soon as it needs some memory the first thing it does it drops all of the in-memory copies of file-backed pages, which includes all of your programs. Then whenever one of your programs wants to run, or continue running by branching to a far address, it has to swap in that code from disk again. Even though you thought your system does not "have swap", it does have swap in effect. The workaround for this is to copy important programs into memory and pin them there with mlock. It is particularly important that if you rely on a userspace OOM killer it gets locked into memory.
I've also found that under these conditions kswapd will effectively consume all your CPU time. The time it spends running is probably proportional to your maximum memory too - in our case it parses through 500+GB of LRU. The blocking writeback behaviour can be managed effectively with dirty page writeback ratio tuning. You don't want to block trying to write 50GB to disk at once when you hit the high dirty page watermark.
The amount of dirty pages the kernel keeps in RAM is configurable through sysctl. If the buffer is full any further write blocks the process. If you have less free RAM than the allowed dirty page buffer what you said is correct though.
There is a new patch[1] that allows to set a soft and hard minimum of RAM reserved for clean pages. This fixes the problem almost completely even under the heaviest loads.
cgroups like a silbling post mentioned can also help by setting soft limits for heavy background tasks like compile jobs. Setting a soft limit will give them as much RAM as is available, or swap them out completely when things are heavily contented, effectively pausing the processes. It requires some setting up though, so it's not a solution for all cases, but it can make sense even without the thrashing problem.
> The workaround for this is to copy important programs into memory and pin them there with mlock.
Nit: I don't think it's necessary to copy the program into anonymous memory to use mlock. You might be thinking of huge pages (transparent or otherwise), which are supported on anonymous pages and unfortunately aren't yet supported on ext4- or btrfs-backed file pages.
The resource control aspects of cgroups actually improve this situation, if you take advantage of them.
You'll still see that page fault thrashing but it becomes isolated to the cgroup experiencing the memory pressure. It doesn't bring down the entire system in my experience.
I'd never considered the program itself getting dropped and re-read from disk -- but it makes perfect sense. I'm curious if executing the programs from a ramdisk would achieve the same effect.
So, if a process is in an uninterruptible sleep, because say, it's doing i/o, say reading from disk, and furthermore, say it's using O_DIRECT, so the storage stack is going to set things up to bypass the buffer cache and DMA directly into the process's memory, and then you rip away the process's memory and give it to another process... and then the DMA completes, and kaboom? The DMA just clobbered an unsuspecting process's memory?
That is, it was my understanding that the reason a process is in an uninterruptible sleep is generally because it's waiting for a DMA to complete, and if you were permitted to interrupt it, the DMA would eventually complete and clobber who knows what. Ripping the memory away from such a process would, to my first glance, seem to encounter the same problem -- how do you stop the pending DMA (which might already be in progress, but which might also take awhile to complete.) Whatever method, it would be device dependent, which makes it impractical (who's going to retrofit all the device drivers with DMA stopping APIs, and I'm sure there are many devices that have no way to stop pending DMAs, and anyway, maybe the DMA is already in progress, only half completed.) Maybe do it at the pci level, unmap DMA buffers. But many current drivers will generally assume that they never give their devices bad bus addresses, yet now the device is attempting DMA to a bad bus address (i.e. a suddenly unmapped bus address).
Well, I've been out of the linux driver game for awhile now, so perhaps I'm missing or forgetting something. Ripping memory out from under a process with pending DMA sounds pretty sketchy to me though.
But, if dumb old me can think of this, of course the kernel developers can also, and undoubtedly did. Wonder how it really works?
This is right. Those physical pages get pinned by the device.
It's not like it's a new issue, anyway - physical backing pages can be dropped and re-used even if the process is still running, so the reference counting always had to exist.
> Mayne usermode processes shouldn't have uninterruptible I/O access.
This isn't something userspace processes opt into. It's just how blocking read and write syscalls on filesystems work [1]. If you ever hit a bad block, you may notice that threads that touch it through the fs just hang and you can't recover or kill them. This is how it's always been on Linux and other Unix systems. I of course absolutely hate this and am 100% behind you if you're suggesting changing it.
[1] With the exception of nfs if you set the "intr" mount option.
So how do high-reliability Linux systems get around this? Find bad blocks before a process does? Or use non-blocking I/O? (How does Windows/VMS handle it, is another interesting question)
Windows NT is async at it's core, so it's handled more or less as Linux non-blocking I/O is, but across at least anything in the realm of the VFS rather being opt in by a FS implementation like Linux's non-blocking support.
Which? Pidfds were linked in the article to a great other article. CloudABI was about making an ABI/API so capsicum violations were build-time errors. Everything as a file descriptor is necessary for capsicum.
WASM's WASI is sort of the spiritual successor, but I prefer a diversity of tactics so CloudABI and WASI should both exist.
CloudABI is the best way to save desktop computing that's not a complete Hail Mary.
https://lore.kernel.org/lkml/f8457e20-c3cc-6e56-96a4-3090d7d... I tried to kick of more discussion of spawning processes with pidfds --- I think sticking with fork + (fanicier) exec was the weakest part of CloudABI, but understandable at the time given one was trying to be implementable by all Unix-like kernels. The response from the main pidfds person is that plan should be forthcoming this fall!
A possible solution not to this, but for the OOM killer would be an “importance” attribute orthogonal to the task priority - it’s ok to kill the NTP server or to bounce the DNS proxy. It’s much less OK to kill Emacs or my desktop.
I don't know... Everytime I have an OOM it's been "my fault". Either because I wrote some code that allocated memory in a busy loop, or because I tried to compile something in a memory starved VM.
I could totally see myself running out of memory trying to open /dev/sda in emacs or doing something stupid in elisp. So killing emacs makes a lot more sense than some innocent daemon.
My experience of the linux OOM killer is not that its opinion differs from mine but that it has no opinion at all for a long, long time after the system is in deep trouble. The OOM killer simply does not act quickly enough to save systems. Sadly it's not customisable but 'earlyoom' (packaged for debian and probably everything else) is. I turned it on when I was messing about when debugging a badly behaved bit of software which went into a memory allocation loop and have just left it on. It's saved me a few times and I now plan to leave it on forever.
Looks like oomd is an idea along the same lines but with slightly different goals. It's not in my distro so not an easy option for me.