Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sure human failure is probabilistic. But you can design around that by stacking reliability-enhancing approaches together.

Let’s say there’s a 10% chance of any given feature being broken. Write a test, (which has another 10% chance of being broken) and now it’s only broken if the test and the code are broken, and broken in the same way. So we’re down to <1% chance of failure. In my experience most bugs that make it past testing do so because you forgot a test.

Then add a backup / redundancy system. That has a 10% chance of failure, but if you test it regularly then the backup / restore process only has a 1% chance of failure.

Now we have a system that’s pretty reliable in practice, made out of pieces which are only 90% reliable. And no need for PhD level formal methods.

Just do the obvious robustness steps: Write unit tests. Run them with every commit. Have a backup system. Test it. Have redundant servers. Do stages deployments. Monitor your servers and have an on-call roster. Then when everything is working well, add a chaos monkey to increase the failure rate of all of these parts so your team & software gets practice dealing with problems.

The fact that this bug slipped past all of their reliability engineering - past code review and testing into production and in a way they can’t recover - that smells of sloppy work.



They had backup restore process.

The trouble was the restore would set back everyone’s data to that point in time, whereas only some customers data was impacted.


I wonder if in retrospect that would have been better. If they had rolled back to a snapshot 30 minutes after they realized they had a problem, everyone loses 30 minutes of updates (and maybe transaction logs can be copied before the rollback and then replayed to reduce that to even less). Everyone experiences a little bit of pain instead of some customers being down for a week plus. Easy to speculate about from the cheap seats though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: