Race conditions are weird. I had a service that ran fine for years if not decade...

Race conditions are weird.

I had a service that ran fine for years if not decades on Java. One day, a minor update came in to the GNU core utils, which were not at all used by the service itself, and this somehow triggered the race every time in less than 5 minutes, taking down our production cluster. The same update didn't do anything to preproduction, even under much higher load than prod had.

There was a clear bug to fix and a clear root cause. Even so, I never understood what exactly pushed it over the edge.