> one really gets the sense that it took them 75 minutes to go from "things are breaking" to "we've narrowed it down to a single service endpoint, but are still researching," which is something of a bitter pill to swallow
Is 75 minutes really considered that long of a time? I don't do my day-job in webdev, so maybe I'm just naive. But being able to diagnose the single service endpoint in 75 minutes seems pretty good to me. When I worked on firmware we frequently spent _weeks_ trying to diagnose what part of the firmware was broken.
> Is 75 minutes really considered that long of a time? [...] When I worked on firmware we frequently spent _weeks_ trying to diagnose what part of the firmware was broken.
One might spend weeks diagnosing a problem if the problem only happens 0.01% of the time, correlated with nothing, goes away when retried, and nobody can reproduce it in a test environment.
But 0.01%-and-it-goes-away-when-retried does not make a high priority incident. High priority incidents tend to be repeatable problems that weren't there an hour ago.
Generally a well designed, properly resourced business critical system will be simple enough and well enough monitored that problems can be diagnosed in a good deal less than 75 minutes - even if rolling out a full fix takes longer.
Of course, I don't know how common well designed, properly resourced business critical systems are.
A few years back I was working at a software company that provided on-site sensor sensor networks to hospitals, pharmacies, etc. Our product required them to physically install a server on-site, but we were starting to get disrupted by cloud-based solutions. Essentially what we did was alert medical staff when blood, organs, etc. refrigeration temperatures went out of range. If the right people involved did not get notifications on time for these issues people will die. Its not hyperbole, you have to wait years for liver transplant. Their aren't just new livers available for everyone if a handful of them spoil.
With that being said, the problem here isn't that it took 75 minutes to find the root cause, but rather that the fix took hours to propagate through the us-east-1 data center network. Which is completely unacceptable for industries like healthcare where even small disruptions are a matter of life and death.
>Is 75 minutes really considered that long of a time?
From my experience in setting up and running support services, not really. It's actually pretty darn quick.
First, the issue is reported to level 1 support, which is bunch of juniors/drones on call, often offshore (depending on time of the day) who'll run through their scripts and having determined that it's not in there, escalate to level 2.
Level 2 would be more experienced developer/support tech, who's seen a thing or two and dealt with serious issues. It will take time to get them online as they're on call but not online at 3am EST, as they have to get their cup of joe, turn on the laptop etc. Would take them a bit to realize that the fecal matter made contact with the rotating blades and escalate to level 3.
Which involves setting up the bridge, waking up the decisions makers (in my case it was director and VP level), and finally waking up the guy who either a) wrote all this or b) is one of 5 or 6 people on the planet capable of understanding and troubleshooting the tangled mess.
I do realize that AWS support might be structured quite a bit differently, but still... 75 minutes is pretty good.
Edit: That is not to say that AWS doesn't have a problem with turnover. I'm well aware of their policies and tendency to get rid of people in 2/3 years, partially due to compensation structures where there's a significant bump in compensation - and vesting - once you reach that timeframe.
But in this particular case I don't think support should take much of a blame. The overall architecture on the other hand...
Sorry, are you saying you worked at Amazon and this is how they handle major outages? Just snooze and wait for a ticket to make its way up from end user support? No monitoring? No global time zone coverage?
Because if so, this seems like about the most damning thing I could learn from this incident.
No, it's just mindless speculation from someone who clearly hasn't worked a critical service's on call rotation before. Not at all what it's actually like, all these services have automatic alarms that will start blaring and firing pagers, and once scope of impact is determined to be large escalations start happening extremely quickly paging anyone even possibly able to diagnose the issue. There's also crisis rotations staffed with high level ICs and incident managers who will join ASAP and start directing the situation, you don't need to wait for some director or VP.
I worked at AWS (EC2 specifically), and the comment is accurate.
Engineers own their alarms, which they set up themselves during working hours. An engineer on call carries a "pager" for a given system they own as part of a small team. If your own alert rules get tripped, you will be automatically paged regardless of time of day. There are a variety of mechanisms to prioritize and delay issues until business hours, and suppress alarms based on various conditions - e.g. the health of your own dependencies.
End user tickets can not page engineers but fellow internal teams can. Generally escalation and paging additional help in the event that one can not handle the situation is encouraged and many tenured/senior engineers are very keen to help, even at weird hours.
Wholly inaccurate. AWS Systems Engineers would have been paged by automated monitoring systems once alert thresholds were breached. No escalation through Support needed.
Quite a few of AWS's more mature customers (including my company) were aware within 15 minutes of the incident that Dynamo was failing and hypothesized that it'd taken other services. Hopefully AWS engineers were at least fast.
75 minutes to make a decision about how to message that outage is not particularly slow though, and my guess is that this is where most of the latency actually came from.
The web operates in a very different world if you've invested in good tooling. I used to be lead on a modestly sized payment processing back end to the tune of about 100 transactions/second (we were essentially Stripe for the client facing apps at the company). In many cases our monitoring and telemetry let us identify root cause in a matter of minutes. Not saying that is or should be the norm for all web apps, but what we had was not too far off from a read-only debugger view of the back end app's state throughout the request and it was very powerful. Of course for us more often than not the root cause was "the bank we depend on is having a problem" so our knowledge couldn't do much other than help the company shape customer communications about the incident.
Also it's pretty likely it took less time than that to get an idea, but generally for public updates you want to be very reserved, otherwise users get the wrong impressions.
For a service like AWS, 75 mins is going to result in a LOT of COE's for people on way it wasn't mitigated quicker. A Sev 1 like this has an SLA of 20 mins to mitigate impact. Writing about these failures will consume a dozen peoples time for the next 6 weeks.
I have 10 years of experience at Amazon as an L6/L7 SDM, across 4 teams (Games, logistics, Alexa, Prime video). I have also been on a team that caused a sev 1 in the past.
Amazon is supposed to have the best infrastructure in the business because everyone else runs on it. They should have access to the sre talent that can quickly mitigate this kind of issue
It's 75 minutes to _communicate_ the message to customers. Definitely internal teams were ahead of this before it was posted to the AWS Health Dashboard. Status Page posts are lagging indicators of incident progress.
I work in an incident management team where the turnaround from "we've decided to take x action, to y metric shows it is working, to z is posted on the status page" can be 1-2 minutes.
It is possible with professionals, institutional knowledge, drills, and good tools.
Is 75 minutes really considered that long of a time? I don't do my day-job in webdev, so maybe I'm just naive. But being able to diagnose the single service endpoint in 75 minutes seems pretty good to me. When I worked on firmware we frequently spent _weeks_ trying to diagnose what part of the firmware was broken.