I thought the recovery was early AM Seattle time (like 4am). Where I think start...

nijave · 2025-10-20T23:30:46 1761003046

[09:13 AM PDT] We have taken additional mitigation steps to aid the recovery of the underlying internal subsystem responsible for monitoring the health of our network load balancers and are now seeing connectivity and API recovery for AWS services. We have also identified and are applying next steps to mitigate throttling of new EC2 instance launches. We will provide an update by 10:00 AM PDT.

[08:43 AM PDT] We have narrowed down the source of the network connectivity issues that impacted AWS Services...

[08:04 AM PDT] We continue to investigate the root cause for the network connectivity issues...

[12:11 AM PDT] <declared outage>

They claim not to have known the root cause for ~8hr

CoopaTroopa · 2025-10-21T01:18:43 1761009523

Sure, that timeline looks bad when you leave out the 14 updates between 12:11am PDT and 8:04am PDT.

The initial cause appears to be a a bad DNS entry that they rolled back at 2:22am PDT. They started seeing recovery with services but as reports of EC2 failures kept rolling in they found a network issue with a load balancer that was causing the issue at 8:43am.

znpy · 2025-10-21T20:03:25 1761077005

> Sure, that timeline looks bad when you leave out the 14 updates between 12:11am PDT and 8:04am PDT.

Their 14 updates did not bring my stuff back up.

My nines are not their nines. https://rachelbythebay.com/w/2019/07/15/giant/

CoopaTroopa · 2025-10-23T13:52:22 1761227542

I didn't say they fixed everything within those 14 updates. I'm pointing out it's disingenuous to say they didn't start working on the issue until start of business when there are 14 updates of what they have found and done during that time.

anon7000 · 2025-10-21T00:17:30 1761005850

I don’t think that’s true, there was an initial Dynamo outage that was resolved in the wee hours that ultimately cascaded into the ec2 problem that lasted most of the day

nijave · 2025-10-21T00:27:27 1761006447

Was the Dynamo outage separate? My take was the NLB issue was the root cause and Dynamo was a symptom which they flipped some internal switches to mitigate the impact to that dependency

easton · 2025-10-21T00:59:32 1761008372

If their internal NLB monitoring can delete the A record for dynamodb that seems like a weird dependency (like, i can imagine the nlb going missing entirely can cause it to clean up via some weird orchestration, but this didn't sound like that).

nijave · 2025-10-21T13:37:09 1761053829

I was thinking more along the lines of the NLB being in front of DNS servers and dropping resolvers

Or an NLB could also be load balancing by managing DNS records--it's not really clear what a NLB means in this context

Or there was an overload condition because of the NLB malfunctioning that caused UDP traffic to get dropped

Obviously a lot of reading between the lines is required without a detailed RCA--hopefully they release more info

technofiend · 2025-10-21T13:03:44 1761051824

huh.. maybe publicly communicated recovery was then. I was seeing knock-on effects hours later and didn't see full recovery until late afternoon EST.