Sure human failure is probabilistic. But you can design around that by stacking ...

jimbokun · on April 14, 2022

They had backup restore process.

The trouble was the restore would set back everyone’s data to that point in time, whereas only some customers data was impacted.

throwawayboise · on April 14, 2022

I wonder if in retrospect that would have been better. If they had rolled back to a snapshot 30 minutes after they realized they had a problem, everyone loses 30 minutes of updates (and maybe transaction logs can be copied before the rollback and then replayed to reduce that to even less). Everyone experiences a little bit of pain instead of some customers being down for a week plus. Easy to speculate about from the cheap seats though.