Turning it off and then back on again probably fixes the issue. There’s very unlikely a grand ticking time bomb just waiting to bring it all down. Recycling servers will probably keep it running.
> Turning it off and then back on again probably fixes the issue.
Turning a large scale system entirely off and on is never simple. Invariably you’ll run into some kind of circular dependency that must be manually investigated. And even tracking those down becomes tricky.
Classic examples are things like DNS, service locators, or authentication systems. And large tech companies are notorious for NIH-syndrome for all of those.
There’s so much redundancy built into modern distributed systems that you can reliably bounce a VM without issue. You can reliably roll bounce a series of VMs.
Twitter doesn’t have unique scale problems by todays standards.
Things do still need to be fixed of course.