Hacker Newsnew | past | comments | ask | show | jobs | submit | aavshr's commentslogin

> In short, a latent bug in a service underpinning our bot mitigation capability started to crash after a routine configuration change we made. That cascaded into a broad degradation to our network and other services. This was not an attack.

From the CTO, Source: https://x.com/dok2001/status/1990791419653484646


It still astounds me that the big dogs still do not phase config rollouts. Code is data, configs are data, they are one and the same. It was the same issue with the giant crowdstrike outage last year, they were rawdogging configs globally and a bad config made it out there and everything went kaboom.

You NEED to phase config rollouts like you phase code rollouts.


The big dogs absolutely do phase config rollouts as a general rule.

There are still two weaknesses:

1) Some configs are inherently global and cannot be phased. There's only one place to set them. E.g. if you run a webapp, this would be configs for the load balancer as opposed to configs for each webserver

2) Some configs have a cascading effect -- even though a config is applied to 1% of servers, it affects the other servers they interact with, and a bad thing spreads across the entire network


> Some configs are inherently global and cannot be phased

This is also why "it is always DNS". It's not that DNS itself is particularly unreliable, but rather that it is the one area where you can really screw up a whole system by running a single command, even if everything else is insanely redundant.


I don’t believe that there is anything necessarily which requires DNS configs to be global.

You can shard your service behind multiple names:

my-service-1.example.com

my-service-2.example.com

my-service-3.example.com …

Then you can create smoke tests which hit each phase of the DNS and if you start getting errors you stop the rollout of the service.


Sure, but that doesn't really help for user-facing services where people expect to either type a domain name in their browser or click on a search result, and end up on your website every time.

And the access controls of DNS services are often (but not always) not fine-grained enough to actually prevent someone from ignoring the procedure and changing every single subdomain at once.


> Sure, but that doesn't really help for user-facing services where people expect to either type a domain name in their browser or click on a search result, and end up on your website every time.

It does help. For example, at my company we have two public endpoints:

company-staging.com company.com

We roll out changes to company-staging.com first and have smoke tests which hit that endpoint. If the smoketests fail we stop the rollout to company.com.

Users hit company.com


That doesn’t help with rolling out updates to the DNS for company.com which is the point here. It’s always DNS because your pre-production smoke tests can’t test your production DNS configuration.


If I'm understanding it right, the idea is that the DNS configuration for company-staging.com is identical to that for company.com - same IPs and servers, DNS provider, domain registrar. Literally the only differences are s/company/company-staging/, all accesses should hit the same server with the same request other than the Host header.

Then you can update the DNS configuration for company-staging.com, and if that doesn't break there's very little scope for the update to company.com to go differently.


The purpose of a staged rollout is to test things with some percentage of actual real-world production traffic, after having already thoroughly tested things in a private staging environment. Your staging URL doesn't have that. Unless the public happens to know about it.

The scope for it to go wrong is the differences in real-world and simulation.

It's a good thing to have, but not a replacement for the concept of staged rollout.


But users are going to example.com. Not my-service-33.example.com.

So if you've got some configuration that has a problem that only appears at the root-level domain, no amount of subdomain testing is going to catch it.


I think it's uncharitable to jump to the conclusion that just because there was a config-based outage they don't do phased config rollouts. And even more uncharitable to compare them to crowdstrike.


I have read several cloudflare postmortems and my confidence in their systems is pretty low. They used to run their entire control plane out of a single datacenter which is amateur hour for a tech company that has over $60 billion in market cap.

I also don’t understand how it is uncharitable to compare them to crowdstrike as both companies run critical systems that affect a large number of people’s lives, and both companies seem to have outages at a similar rate (if anything, cloudflare breaks more often than crowdstrike).


https://blog.cloudflare.com/18-november-2025-outage/

> The larger-than-expected feature file was then propagated to all the machines that make up our network

> As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.

I was right. Global config rollout with bad data. Basically the same failure mode of crowdstrike.


It seem fairly logical to me? If a config change causes services to crash then rollout stops … at least in every phased rollout system i’ve ever built…


In a company I am no longer with I argued much the same when we rolled out "global CI/CD" on IAC. You made one change, committed and pushed, wham it's on 40+ server clusters globally. I hated it. The principal was enamored with it, "cattle not pets" and all that, but the result was things slowed down considerably because anyone working with it became so terrified of making big changes.


Then you get customer visible delays.


Because adversaries adapt quickly, they have a system that deploys their counter-adversary bits quickly without phasing - no matter whether they call them code or configs. See also: Crowdstrike.


You can't protect against _latent bugs_ with phased rollouts.


Wish this could rocket to the top of the comment thread, digging through hundreds of comments speculating about a cyberattack to find this felt silly


Configuration changes are dangerous for CF it seems, and knocked down $NET almost 4% today. I wonder what the industry wide impact is for each of these outages?


Pre market was red for all tech stocks today before the outage even happened


Yes, if anything it's bullish on CloudFlare because many investors don't realize how pervasive it is.


>Configuration changes are dangerous for CF it seems, and knocked down $NET almost 4% today. I wonder what the industry wide impact is for each of these outages?

This is becoming the "new normal." It seems like every few months, there's another "outage" that takes down vast swathes of internet properties, since they're all dependent on a few platforms and those platforms are, clearly, poorly run.

This isn't rocket surgery here. Strong change management, QA processes and active business continuity planning/infrastructure would likely have caught this (or not), as is clear from other large platforms that we don't even think about because outages are so rare.

Like airline reservations systems[0], credit card authorization systems from VISA/MasterCard, American Express, etc.

Those systems (and others) have outages in the "once a decade" or even much, much, longer ranges. Are the folks over at SABRE and American Express that much smarter and better than Cloudflare/AWS/Google Cloud/etc.? No. Not even close. What they are is careful as they know their business is dependent on making sure their customers can use their services anytime/anywhere, without issue.

It amazes me the level of "Stockholm Syndrome"[1] expressed by many posting to this thread, expressing relief that it wasn't "an attack" and essentially blaming themselves for not having the right tools (API keys, etc.) to recover from the gross incompetence of, this time at least, Cloudflare.

I don't doubt that I'll get lots of push back from folks claiming, "it's hard to do things at scale," and/or "there are way too many moving parts," and the like.

Other organizations like the ones I mention above don't screw they're customers every 4-6 months with (clearly) insufficiently tested configuration and infrastructure changes.

Yet many here seem to think that's fine, even though such outages are often crushing to their businesses. But if the customers of these huge providers don't demand better, they'll only get worse. And that's not (at least in my experience) a very deep or profound idea.

[0] https://en.wikipedia.org/wiki/Airline_reservations_system

[1] https://en.wikipedia.org/wiki/Stockholm_syndrome


just curious, why not the sonnet models? In my personal experience, Anthropic's Sonnet models are the best when it comes to things like this!


yes, had to use reader mode.


Ah that might be a bug sorry, the sample notes should also have the html extension. The notes themselves are currently stored as html (because of the metadata and the state in notes mainly, you can still export to markdown).

We need to do some work before we can also just store them simply as markdown files.


Thank you.

Yes, we support any endpoint that supports the completions API. And yes, Ollama might be the easiest to setup. The images should also work with qwen3-vl.

But if you run into any issues, please feel free to submit a bug report https://github.com/deta/surf/issues

Edit: fixed github issues link


Please make yourself aware of all the facts before posting ignorant comments like this. Children died, shot while protesting against blatant corruption and lack of accountability that's going on for decades. The social media ban was just the final straw.

https://www.aljazeera.com/news/2025/9/8/six-killed-in-nepal-...


Great work, I can imagine how interesting the migration went. Why are you using Redis if I may ask? Was something like sqlite not enough? What are your biggest challenges with Tauri?

I also work with an Electron app and we also do local embeddings and most of the CPU intensive work happens in nodejs addons written in Rust and using Neon (https://neon-rs.dev very grateful for this lib). This is a nice balance for us.


Thanks! We went with Rust because we weren't able to tune sqlite with a vector search extension to give me the results we wanted. I'm sure it's possible to use it instead of Redis but that's an optimization for another day. I'll check out Neon.


Sad to see, Outer Wilds is probably the best game I've ever played. But I'm sure the staff will continue doing great work in whatever they decide to do next.


This is the publisher, not the developer, of Outer Wilds imploding. The people who actually made the game are still working at the studio.


just curious, why is this the case?


I am not against AI in software per se (I like the ChatGPT desktop app), but I prefer not to have AI in tools where the primary use case is not related to AI. Specifically, I don't want to have to consider how the AI is being developed and integrated (e.g., whether my data is being used for training an LLM), and, in general, I prefer for my software to be a bit conservative in relation to new trends (being it anything from crypto to AI).


this is brilliant, the questions are hilarious, didn't beat the game yet


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: