Sure human failure is probabilistic. But you can design around that by stacking reliability-enhancing approaches together.
Let’s say there’s a 10% chance of any given feature being broken. Write a test, (which has another 10% chance of being broken) and now it’s only broken if the test and the code are broken, and broken in the same way. So we’re down to <1% chance of failure. In my experience most bugs that make it past testing do so because you forgot a test.
Then add a backup / redundancy system. That has a 10% chance of failure, but if you test it regularly then the backup / restore process only has a 1% chance of failure.
Now we have a system that’s pretty reliable in practice, made out of pieces which are only 90% reliable. And no need for PhD level formal methods.
Just do the obvious robustness steps: Write unit tests. Run them with every commit. Have a backup system. Test it. Have redundant servers. Do stages deployments. Monitor your servers and have an on-call roster. Then when everything is working well, add a chaos monkey to increase the failure rate of all of these parts so your team & software gets practice dealing with problems.
The fact that this bug slipped past all of their reliability engineering - past code review and testing into production and in a way they can’t recover - that smells of sloppy work.
I wonder if in retrospect that would have been better. If they had rolled back to a snapshot 30 minutes after they realized they had a problem, everyone loses 30 minutes of updates (and maybe transaction logs can be copied before the rollback and then replayed to reduce that to even less). Everyone experiences a little bit of pain instead of some customers being down for a week plus. Easy to speculate about from the cheap seats though.
Let’s say there’s a 10% chance of any given feature being broken. Write a test, (which has another 10% chance of being broken) and now it’s only broken if the test and the code are broken, and broken in the same way. So we’re down to <1% chance of failure. In my experience most bugs that make it past testing do so because you forgot a test.
Then add a backup / redundancy system. That has a 10% chance of failure, but if you test it regularly then the backup / restore process only has a 1% chance of failure.
Now we have a system that’s pretty reliable in practice, made out of pieces which are only 90% reliable. And no need for PhD level formal methods.
Just do the obvious robustness steps: Write unit tests. Run them with every commit. Have a backup system. Test it. Have redundant servers. Do stages deployments. Monitor your servers and have an on-call roster. Then when everything is working well, add a chaos monkey to increase the failure rate of all of these parts so your team & software gets practice dealing with problems.
The fact that this bug slipped past all of their reliability engineering - past code review and testing into production and in a way they can’t recover - that smells of sloppy work.