[09:13 AM PDT] We have taken additional mitigation steps to aid the recovery of the underlying internal subsystem responsible for monitoring the health of our network load balancers and are now seeing connectivity and API recovery for AWS services. We have also identified and are applying next steps to mitigate throttling of new EC2 instance launches. We will provide an update by 10:00 AM PDT.
[08:43 AM PDT] We have narrowed down the source of the network connectivity issues that impacted AWS Services...
[08:04 AM PDT] We continue to investigate the root cause for the network connectivity issues...
[12:11 AM PDT] <declared outage>
They claim not to have known the root cause for ~8hr
Sure, that timeline looks bad when you leave out the 14 updates between 12:11am PDT and 8:04am PDT.
The initial cause appears to be a a bad DNS entry that they rolled back at 2:22am PDT. They started seeing recovery with services but as reports of EC2 failures kept rolling in they found a network issue with a load balancer that was causing the issue at 8:43am.
I didn't say they fixed everything within those 14 updates. I'm pointing out it's disingenuous to say they didn't start working on the issue until start of business when there are 14 updates of what they have found and done during that time.
I don’t think that’s true, there was an initial Dynamo outage that was resolved in the wee hours that ultimately cascaded into the ec2 problem that lasted most of the day
Was the Dynamo outage separate? My take was the NLB issue was the root cause and Dynamo was a symptom which they flipped some internal switches to mitigate the impact to that dependency
If their internal NLB monitoring can delete the A record for dynamodb that seems like a weird dependency (like, i can imagine the nlb going missing entirely can cause it to clean up via some weird orchestration, but this didn't sound like that).