At Google almost 20 years ago, a bunch of our machines (possibly with slightly b...

At Google almost 20 years ago, a bunch of our machines (possibly with slightly bespoke CPUs?) were behaving oddly. These machines were mostly in use for serving Google's web index, so almost the entire RAM was devoted to index data; the indexserver processes were designed to be robust against hardware failure, and if they noticed any kind of corruption they'd dump and reload their data. We noticed that they were dumping/reloading massively more often than they'd expect.

Eventually the cause was narrowed down to that, randomly when the machine was stressed, the second half (actually, the final 2052 bytes) of some physical page in memory would get zeroed out. This wasn't great for the indexservers but they survived due to the defensive way that they accessed their data. But when we tried to use these new machines for Gmail, it was disastrous - random zeroing of general process code/data or even kernel data meant things were crashing hard.

We noticed from the kernel panic dumps (Google had a feature that sent kernel panics over the network to a central collector, which got a lot of use around this time) that a small number of pages were showing up in crash dump registers far more often than would statistically be expected. This suggested that the zeroing wasn't completely random. So we added a list of "bad pages" that would be forcefully removed from the kernel's allocator at boot time, so those pages would never be allocated for the kernel or any process. Any time we saw more than a few instances of some page address in a kernel panic dump, we added it to the list for the next kernel build. Like magic, this dropped the rate of crashes down into the noise level.

The root cause of the problem was never really determined (probably some kind of chipset bug) and those machines are long obsolete now. But it was somehow discovered that if you reset the machine via poking some register in the northbridge rather than via the normal reset mechanism, the problem went away entirely. So for years the Google bootup scripts included a check for this kind of CPU/chipset, followed by a check of how the last reset had been performed (via a marker file) and if it wasn't the special hard reset, adding the marker file and poking the northbridge to reset again. These machines took far far longer than any other machines in the fleet to reboot due to these extra checks and the double reboot, but it worked.