> how does it persuade you of that? By flowing from *many people think it's bloa...

btown · on Nov 22, 2022

https://twitter.com/atax1a/status/1594880931042824192 provides a bit more context and nuance (albeit with a sardonic tone, but it's hard to blame someone for being sardonic who saw their entire professional community fired).

There's no doubt that OP built a great and stable automation layer on top of Mesos for caching workloads. But there are numerous other types of workloads on top of Mesos (including, I presume mission-critical database deployments that need well-disciplined draining protocols to shift between nodes), as well as administrative needs for the Mesos-to-infrastructure level, and things running on bare metal below the Mesos level. These things all needed dedicated SREs, and the absence of these SREs could result in a scenario like the one mentioned in the Twitter thread I linked - two obscure mutually-dependent components expire and cannot be re-provisioned using documented tools.

I also think an important meta-point is that when Twitter was bringing in substantial revenue from advertising, every minute of downtime would have significant costs - costs that could make it easily worthwhile to "over-provision" SRE talent. With advertisers pausing engagement, perhaps Twitter loses less money from a day-long outage than it would save having the right talent to turn a day-long outage into a minutes-long outage.

Twitter is only judged by its profitability (namely, Musk's ability to service debt without selling more Tesla stock than he already has), while most other tech companies (both public and private) are judged by both profitability and revenue growth. If you want both, larger SRE teams, to say nothing of feature development and regulatory compliance teams, start to make a lot more sense.