I had a service that ran fine for years if not decades on Java. One day, a minor update came in to the GNU core utils, which were not at all used by the service itself, and this somehow triggered the race every time in less than 5 minutes, taking down our production cluster. The same update didn't do anything to preproduction, even under much higher load than prod had.
There was a clear bug to fix and a clear root cause. Even so, I never understood what exactly pushed it over the edge.
I had a service that ran fine for years if not decades on Java. One day, a minor update came in to the GNU core utils, which were not at all used by the service itself, and this somehow triggered the race every time in less than 5 minutes, taking down our production cluster. The same update didn't do anything to preproduction, even under much higher load than prod had.
There was a clear bug to fix and a clear root cause. Even so, I never understood what exactly pushed it over the edge.