> one really gets the sense that it took them 75 minutes to go from "things are ...

michaelt · 2025-10-21T00:08:42 1761005322

> Is 75 minutes really considered that long of a time? [...] When I worked on firmware we frequently spent _weeks_ trying to diagnose what part of the firmware was broken.

One might spend weeks diagnosing a problem if the problem only happens 0.01% of the time, correlated with nothing, goes away when retried, and nobody can reproduce it in a test environment.

But 0.01%-and-it-goes-away-when-retried does not make a high priority incident. High priority incidents tend to be repeatable problems that weren't there an hour ago.

Generally a well designed, properly resourced business critical system will be simple enough and well enough monitored that problems can be diagnosed in a good deal less than 75 minutes - even if rolling out a full fix takes longer.

Of course, I don't know how common well designed, properly resourced business critical systems are.

rkozik1989 · 2025-10-21T16:03:09 1761062589

A few years back I was working at a software company that provided on-site sensor sensor networks to hospitals, pharmacies, etc. Our product required them to physically install a server on-site, but we were starting to get disrupted by cloud-based solutions. Essentially what we did was alert medical staff when blood, organs, etc. refrigeration temperatures went out of range. If the right people involved did not get notifications on time for these issues people will die. Its not hyperbole, you have to wait years for liver transplant. Their aren't just new livers available for everyone if a handful of them spoil.

With that being said, the problem here isn't that it took 75 minutes to find the root cause, but rather that the fix took hours to propagate through the us-east-1 data center network. Which is completely unacceptable for industries like healthcare where even small disruptions are a matter of life and death.

jmaestrooper · 2025-10-21T02:32:14 1761013934

>Is 75 minutes really considered that long of a time?

From my experience in setting up and running support services, not really. It's actually pretty darn quick.

First, the issue is reported to level 1 support, which is bunch of juniors/drones on call, often offshore (depending on time of the day) who'll run through their scripts and having determined that it's not in there, escalate to level 2.

Level 2 would be more experienced developer/support tech, who's seen a thing or two and dealt with serious issues. It will take time to get them online as they're on call but not online at 3am EST, as they have to get their cup of joe, turn on the laptop etc. Would take them a bit to realize that the fecal matter made contact with the rotating blades and escalate to level 3.

Which involves setting up the bridge, waking up the decisions makers (in my case it was director and VP level), and finally waking up the guy who either a) wrote all this or b) is one of 5 or 6 people on the planet capable of understanding and troubleshooting the tangled mess.

I do realize that AWS support might be structured quite a bit differently, but still... 75 minutes is pretty good.

Edit: That is not to say that AWS doesn't have a problem with turnover. I'm well aware of their policies and tendency to get rid of people in 2/3 years, partially due to compensation structures where there's a significant bump in compensation - and vesting - once you reach that timeframe.

But in this particular case I don't think support should take much of a blame. The overall architecture on the other hand...

snowwrestler · 2025-10-21T03:34:26 1761017666

Sorry, are you saying you worked at Amazon and this is how they handle major outages? Just snooze and wait for a ticket to make its way up from end user support? No monitoring? No global time zone coverage?

Because if so, this seems like about the most damning thing I could learn from this incident.

Anon1096 · 2025-10-21T05:23:03 1761024183

No, it's just mindless speculation from someone who clearly hasn't worked a critical service's on call rotation before. Not at all what it's actually like, all these services have automatic alarms that will start blaring and firing pagers, and once scope of impact is determined to be large escalations start happening extremely quickly paging anyone even possibly able to diagnose the issue. There's also crisis rotations staffed with high level ICs and incident managers who will join ASAP and start directing the situation, you don't need to wait for some director or VP.

pavel_pt · 2025-10-21T09:02:38 1761037358

I worked at AWS (EC2 specifically), and the comment is accurate.

Engineers own their alarms, which they set up themselves during working hours. An engineer on call carries a "pager" for a given system they own as part of a small team. If your own alert rules get tripped, you will be automatically paged regardless of time of day. There are a variety of mechanisms to prioritize and delay issues until business hours, and suppress alarms based on various conditions - e.g. the health of your own dependencies.

End user tickets can not page engineers but fellow internal teams can. Generally escalation and paging additional help in the event that one can not handle the situation is encouraged and many tenured/senior engineers are very keen to help, even at weird hours.

cudgy · 2025-10-21T10:35:31 1761042931

“There are a variety of mechanisms to prioritize and delay issues until business hours”

What are business hours for a global provider of critical tech services?

pavel_pt · 2025-10-21T11:13:01 1761045181

Business hours for the team receiving the alarm; many issues can wait to be resolved during your own waking hours if they are not impacting customers.

lljk_kennedy · 2025-10-21T13:35:12 1761053712

"This is important enough for someone to work on as soon as their shift starts, but not important enough to page someone out of bed for."

shepherdjerred · 2025-10-21T05:42:04 1761025324

AWS operates completely than what you're describing.

Alerts and monitoring will results in automatic pages to engineers. There is no human support before it gets escalated.

If an engineer hasn't taken a look within a few minutes, it escalates to their manager, and so on.

lljk_kennedy · 2025-10-21T13:33:55 1761053635

Wholly inaccurate. AWS Systems Engineers would have been paged by automated monitoring systems once alert thresholds were breached. No escalation through Support needed.

Magmalgebra · 2025-10-21T06:26:16 1761027976

Depends on what you're measuring.

Quite a few of AWS's more mature customers (including my company) were aware within 15 minutes of the incident that Dynamo was failing and hypothesized that it'd taken other services. Hopefully AWS engineers were at least fast.

75 minutes to make a decision about how to message that outage is not particularly slow though, and my guess is that this is where most of the latency actually came from.

Merad · 2025-10-21T14:10:00 1761055800

The web operates in a very different world if you've invested in good tooling. I used to be lead on a modestly sized payment processing back end to the tune of about 100 transactions/second (we were essentially Stripe for the client facing apps at the company). In many cases our monitoring and telemetry let us identify root cause in a matter of minutes. Not saying that is or should be the norm for all web apps, but what we had was not too far off from a read-only debugger view of the back end app's state throughout the request and it was very powerful. Of course for us more often than not the root cause was "the bank we depend on is having a problem" so our knowledge couldn't do much other than help the company shape customer communications about the incident.

ecnahc515 · 2025-10-20T23:45:39 1761003939

Also it's pretty likely it took less time than that to get an idea, but generally for public updates you want to be very reserved, otherwise users get the wrong impressions.

estimator7292 · 2025-10-20T23:41:34 1761003694

75 minutes is damn good turnaround for any major problem, IMO

bberrry · 2025-10-21T12:29:10 1761049750

75 minutes to diagnose what's failing is not.

Cyberdogs7 · 2025-10-21T05:07:06 1761023226

For a service like AWS, 75 mins is going to result in a LOT of COE's for people on way it wasn't mitigated quicker. A Sev 1 like this has an SLA of 20 mins to mitigate impact. Writing about these failures will consume a dozen peoples time for the next 6 weeks.

I have 10 years of experience at Amazon as an L6/L7 SDM, across 4 teams (Games, logistics, Alexa, Prime video). I have also been on a team that caused a sev 1 in the past.

rags2riches · 2025-10-21T11:42:51 1761046971

> LOT

Just capitalised for emphasis, right?

> COE

Center of Excellence? Council of Europe? Still wondering even after Googling.

> SLA

Service Level Agreement. This I knew beforehand.

> SDM

Service Delivery Manager?

collinmanderson · 2025-10-21T12:07:39 1761048459

> COE

I guessed this was an internal Amazon thing so I searched “Amazon COE”

Correction of Error

https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-2...

> SDM

Software Developer Manager (from searching Amazon SDM)

https://amazon.jobs/content/en/how-we-hire/sdm-interview-pre...

Cyberdogs7 · 2025-10-21T13:34:25 1761053665

Thank you for providing clarity where I did not.

libria · 2025-10-21T15:40:25 1761061225

> COE

I tell the juniors it stands for Correction of Employment. Keeps them on their toes.

Cyberdogs7 · 2025-10-21T13:34:04 1761053644

Sorry, I sometimes forget the vernacular is not universal. The other sibling comment did provide the correct definitions.

jofzar · 2025-10-21T02:05:42 1761012342

If you a regular company, no. If you are the biggest provider in the world and your change is breaking large parts of the Internet, yes.

__loam · 2025-10-20T23:43:48 1761003828

Amazon is supposed to have the best infrastructure in the business because everyone else runs on it. They should have access to the sre talent that can quickly mitigate this kind of issue

Freedom2 · 2025-10-21T00:27:27 1761006447

What if the SRE talent has a lot of real-life experience but can't pass LeetCode puzzles that have nothing to do with the job?

Eridrus · 2025-10-21T05:07:47 1761023267

I dunno man, what part of the AWS experience leaves you thinking the software is amazing?

It's good enough, but there's no real evidence it's the best, simply the largest.

__loam · 2025-10-21T06:55:52 1761029752

There is a distinction between usability and reliability lol. If AWS reliability trends down then it's an industry wide problem.

raverbashing · 2025-10-21T02:27:00 1761013620

The sre talent moved elsewhere once the rto bs started

lljk_kennedy · 2025-10-21T13:36:50 1761053810

It's 75 minutes to _communicate_ the message to customers. Definitely internal teams were ahead of this before it was posted to the AWS Health Dashboard. Status Page posts are lagging indicators of incident progress.

jeffrallen · 2025-10-21T13:48:47 1761054527

I work in an incident management team where the turnaround from "we've decided to take x action, to y metric shows it is working, to z is posted on the status page" can be 1-2 minutes.

It is possible with professionals, institutional knowledge, drills, and good tools.