We are not using cloud flare. But our domain is also not accessible. We are using digital ocean's DNS service for propagating our IP. Does the DigitalOcean's DNS service depend on Cloudflare service?
An SLA of 100% simply means you agree to compensate your customers (as specified, usually with credit) if your service is down at all, nothing more.
Also, SRE here but not for Cloudflare -- I've never seen SREs directly involved in externally published SLAs, they usually come from legal. We deal with SLOs on more fine grained SLIs than overall uptime
SRE - Site Reliability Engineer (a term Google came up with that's been adopted elsewhere) Google defined it approximately as what happens when you apply software engineering practices to what was traditionally an operations function.
SLO - Service Level Objective - the service level you strive for. If it's higher you have room for experimentation, etc.
SLI - Service Level Indicator - the actual metric(s) you use to measure a service level (latency, error rate, throughput, etc.)
SLA - correct. That’s the contract between the operator and the users which describes the penalties for not meeting agreed-upon SLO
SLO - service level objective, the stated availability (or latency or durability etc) of the service. Usually expressed as a value over a period of time (e.g 99.9% availability as measured over a moving 30 average). The SLO is measured by the SLI.
SLI - service level indicator. Simply, the direct measurement of the service (i.e metrics)
SRE - Site Reliability Engineer, usually a member of a team who is responsible for the continued availability of the service and the poor sap who gets paged when it breaches SLO or has an outage or other impactful event.
I'm not sure you and your parent understand what an SLA means. It's an agreement that, when broken, incurs a penalty.
They aren't saying they guarantee 100% uptime. They're saying they'll pay you for any downtime. It's literally the 3rd paragraph:
> 1.2 Penalties. If the Service fails to meet the above service level, the Customer will receive a credit equal to the result of the Service Credit calculation in Section 6 of this SLA.
(Most people I know consider them meaningless marketing BS that's really just meant to trick people or satisfy some make-work checkbox)
> Cloudflare ("Company") commits to provide a level of service for Business Customers demonstrating: [...] 100% Uptime. The Service will serve Customer Content 100% of the time without qualification.
This is a legal commitment to provide 100% uptime. They are guaranteeing 100% uptime and defining penalties for failing to meet that guarantee. The fact that a penalty is defined does not stop it from being a guarantee.
No, this SLA is a legal commitment to give you credits when Service uptime falls below a certain threshold. The threshold could be anything - 99%, 50%, 100%, etc. Importantly, Cloudflare is not under a legal obligation to provide the Service at or above the agreed threshold, it's under a legal obligation to give you Credits when the Service uptime is below that threshold.
"Service Credits are Customer’s sole and exclusive remedy for any violation of this SLA."
> This is a legal commitment to provide 100% uptime. They are guaranteeing 100% uptime
I don't think you know what a guarantee is.
For example when you buy a new car you get a guarantee that it won't break down. Are they claiming it won't break down? No, of course not. What a guarantee means is that they'll fix it or compensate you if it does.
Looks like it supports parent opinion:
commit - bind to a certain course if policy. It's legal obligation, not a statement about guarantees in physical world (like "this alloy won't melt below t°C")
I completely can understand your emotion. But even the top CDNs can have outages of some form or the other. If site uptime is important, check out https://www.cdnreserve.com/ - it's built on the design principle that the likelihood of two separate platforms having an outage at the same time is close to zero.
Cloudflare going down is one of the things which keeps me awake, My main complaint about Cloudflare is that they are very good at everything they offer that we've become reliant on them for everything.
Exactly but the likelihood of two networks going down at the same time is close to Zero. Check out: https://www.cdnreserve.com/ We rolled it out to complement top CDNs.
True, They're usually due to issues with BGP routes.
It's common to see CF being the DNS/CDN for applications across AWS, GCP, Azure etc. So perhaps CF being down affects more applications than individual cloud platforms?
Yeah, What's up with the competition to Cloudflare? What's the real barrier for entry?
It's not infrastructure anymore, As there is a new PaaS startup every week offering distributed hosting and So why bundling in DNS, DDOS detection+mitigation, cloud workers... with it is so hard?
This is just my take, but Cloudflare looks to be building a "moat" to make entry hard. This is built around two things: 1. economies of scale, 2. a network effect.
As Cloudflare gets bigger, they can provide services more cheaply. This is because (a) they can more fully utilise their data centres and other physical capital investments, (b) they can divide their fixed software costs over more users and (c) they get process efficiencies and discounts with scale.
A new entrant will struggle to match cost unless they're able to obtain similar scale. The bigger Cloudflare gets, the bigger the scale that a new entrant needs to hit before they can match them on cost.
Second they're aiming to build a network effect through having huge number of locations. The more locations, the more appealing to new customers as they can be close to more users. A competitor will have to build a similar number of locations to match Cloudflare's proposition.
A new entrant cannot provide as much value, and therefore cannot charge as high a price, without building a similar sized network. This again requires the entrant to invest heavily before they can charge a similar price.
-
The combination of these two things mean that when Cloudflare is operating at a large scale with a large network it can offer a more valuable service (and charge a higher price) than a new entrant, and earn more profit because it can operate at a lower cost.
Also, Cloudflare has the option of lowering its price and still being profitable due to lower costs at its scale, so it can deter entrants from trying to compete by the threat of being able to lower prices below what is profitable for new entrants.
The only players who can compete may be those who already have comparable size - Amazon, Google, Microsoft, Facebook, CDNs, etc, since they will already have addressed the issues of scale and network effects. However, they may not want to cannibalise their existing markets. It will be hard for other new entrants to compete.
There are many noteworthy players - Akamai, Fastly etc., and Edge plaoviders like ourselves (Zycada) who complement top CDNs like Akamai, Cloudflare, Fastly.
The main difference between Cloudflare and the others mentioned is the price; One can start with CF for a side project for free and continue to use it free till it becomes a viable startup.
Others at best offer a limited trial plan, But most are just 'Speak to expert/ Contact us' for pricing which means haggling with a sales rep while we can just build things. Even the paid plans of CF is reasonable when compared with others with better features.
You can't build a Cloudflare competitor in AWS/Azure/Linode/DO/etc. You need your own data centers. Multiple of them across the country, ideally around the world if you want to serve the whole world.
Thanks for the update, just curious if we will get a report on what happened ? In as much detail as can be of course - morbid curiosity mainly. I love the post reports these events usually bring.
Sites are gradually reappearing as I type this. Some of my sites, and doordash.com, were returning 500 errors again just a minute ago. They just came back up, followed by the CF dashboard loading again.
DR means "disaster recovery," it is a formal plan used to respond to and mitigate potential risks to the business. Things like having a communications plan for an incident, or a backup office outside of your main office natural disaster zone.
I really dislike that they are editing their status messages.
Entry[1] dated "Jun 21, 2022 - 06:43 UTC" has been edited to include more detail after they posted another entry at 06:57 UTC. There seems to be no indication that the message has been altered.
Currently text on the status page may suggest that they identified the problem immediately but it took about 15 minutes. Previously there was a text stating that customers should expect update within 15 minutes. Next message was posted 14 minutes after that but previous message was altered later and nothing indicates this.
Strongly agree. Such whitewashing puts all previous incident reports in doubt - can I trust CF summaries of outages, or did they rewrite that history too.
I understand your point, but CloudFlare generally is very transparent, including root cause analysis and their CTO reaching out directly. It could also be a mistake or not so well thought about instead of assuming bad intentions.
edit: Not that I necessarily agree with the article even in light of there being an outage, cloudflare has been pretty good for us. Just thought it was interesting.
Mid 2000s one computer science professor said that internet capacity is not going to match the amount of traffic. Everybody laughed. The wold was full of dark fiber after dot-com-bust.
But if you look at his math, it was correct. The era of client-server connected heterogeneous distributed Internet is just a side show today.
The solution has been centralization (clarification: big companies run their own caches and networks near users). and growth of caches and then Cloudflare taking care of the rest.
> The solution has been centralization and growth of caches.
Centralization and growth of caches are on their face contradictory.
Perhaps you mean organizational centralization but that really has nothing to do with internet capacity demands. Your hot take isn’t so brilliant. What’s fundamentally wrong with edge distribution?
Yes. This is exactly what I mean. Big companies run their own caches and networks near users. Cloudflare takes care of the rest.
>What’s fundamentally wrong with edge distribution?
You incorrectly assume judgement from my part. My point is that things have changed. New problems arise in solution to old problems. Fragility from small number of organizations running their caches to solve bandwidth problem.
But web2 is going (really) great once you depend on Cloudflare. And it is certainly not re-centralizing the whole internet with a provider that is a single point of failure. /s
So yeah, how is a CDN centralizing your infra? You could just have your CNAMEs point to a different provider or directly to your gateways. Or you could even go down the multi CDN path, and have someone like ns1 automatically redirect your CNAMEs to an alternate CDN on a per-geo basis to overcome local failures.
It's just another SaaS component in your system. You could self host if you're willing to take on the ownership challenge, and at certain scale it would even be more cost effective.
Not the OP, but CloudFlare is not only a CDN but does everything you mentioned in your comment for you, so it's the load balancer and the DNS as well. When it goes down everything goes down.
Technically you could set up a separate DNS/failover somewhere else and use a backup reverse proxy/TLS terminator/CDN SaaS similar to CloudFlare, but then that somewhere else will be your point of failure.
It's time to start discussing a fail-open option for us CF users. Most of my sites are using CF for global performance rather than DDoS protection and security. I'd be fine with them changing DNS to point to the origin (or any other user defined IPs) in case of issues (even if it would take hours to return to normal).
This is also important for countries with limited connectivity to the Internet, if the PoP in that country looses it's connection back to CF it shuts everything down, so even if the origin is in the next rack over from the PoP, it's un-reachable.
You can implement your own DNS server that CNAMEs to Cloudflare and falls back to origin IP when there is a problem with Cloudflare.
I think a downstream Cloudflare provider could provide such services if they desire.
My company was paying $20 a month. We were heavily depended on CF, we'd have been happy to pay more.
But... the one feature we wanted was for our accounts team to have their own login so the ops team didn't have to download invoices every month. Nope, that one feature required an enterprise plan which they quoted $4,000 a month for.
Companies where you have to log in and download invoices are the worst. If there's a viable alternative to their products I switch immediately. You make it seem like it's not a big deal, but a reasonably sized startup has dozens of service providers. Should we pay every little service $4k/mo just to save the communications and context switching overhead?
You jest, but imagine how time consuming it would be if every app we used was setup like CloudFlare, where only the one super admin can deal with billing.
Also in these days of remote work, it's a problem if the credit card details need updating - either you have to give the company card details over a slack call, or you need to give a card holder your root password.
Imagine already paying for a service and then having someone snark at you for wanting things for free.
I tried to exercise some restraint this time, but screw it. Here's another rant:
Beware of Cloudflare's tactic of luring people in to their CDN product with "free" bandwidth, and then locking useful features arbitrarily behind what I can only imagine is a thousands of dollars per month enterprise plan. Just look at their cache-purging page for a super obvious example of this (there are plenty more, way too many to list), everything other than basic purge by URL is enterprise only: https://developers.cloudflare.com/cache/how-to/purge-cache/
These days Cloudflare is literally my last choice for a CDN for my new projects. My new go-to is bunny.net, who charges a reasonable usage-based fee for bandwidth and gives you unfettered access to all the features they've built (and doesn't route your users to farther/closer nodes based on how much you pay: https://cloudflare-test.judge.sh/). Though I'd even reach for Cloudfront with their expensive bandwidth costs these days, because at least their pricing is transparent and scales smoothly with usage, and they don't arbitrarily cut you off from useful features that you might not know you need yet.
Even their bandwidth might not really be "free", since I've heard if you actually use any significant amount, the sales people will come knocking on your door to coerce you to get on the same enterprise plan or have your site taken down.
Can I ask out of interest (most of my projects are high perf/low traffic) what kind of traffic you are dealing with at the point you decide you need a CDN?
I don't really use a CDN to manage high traffic volumes. It's more to provide a better, lower-latency experience for my users regardless of where they access my apps from.
You’d need to have TLS certs on origin ready to go for this scenario to work. Additionally, you’d need to make sure to test it and ensure that there’s nothing wrong in this event.
On top of that, depending on your scale, can you take all the traffic on origin that Cloudflare currently offloads?
No issue for me. This is obviously a power-user option. It's kind-of implemented for Enterprise users were you don't have to let CF have full control over the domain.
Probably not many users who need the performance and can handle unexpected failover. There would also be the issue of setting the policy defaults effectively. Most users wouldn’t benefit from this footgun.
If you’re serious, you could probably automate this right now with your DNS provider and uptime monitoring.
I'm so serious that I already have failover after 1 hour at the registrar level, but those changes are not immediate and can take up to 24h to roll-in and roll-back due to DNS propagation and caching.
I think there is no nesting maximum (or if there is, it's much bigger than this). There's a limit which stops you replying to a comment immediately, to prevent super long quick-fire arguments.
Ah, sorry, misunderstood you. You can’t rely on them to change their host records when they’re down.
If you want CDN-independent automatic failover, look into anycast with two providers. If one of them is Cloudflare, use the tier that lets you manage your DNS elsewhere.
There is no immediate option with DNS changes. CF can’t immediately remove their IP from the route. Sounds like you’ve solved your problem in the sense that you have an automatic failover, though, which is good.
You say that, but there's tons of automated attempts doing the rounds on everything directly connected to the internet; centralized providers like Cloudflare can detect and prevent these patterns, whereas you need to be on the ball yourself if you have a service directly open to the internet. Exploits are exploited quickly, and while I make no assumptions about your particular website / application, a lot cannot push an update on short notice.
That would leak your IP nevertheless. People can figure out that you're serving a specific website by inspecting your certificate on handshake without actually connecting.
Wow lots of websites are affected, including Medium. The perils of centralization strike again. Though ironically, I noticed that the IPFS website uses cloudflare as well. The actual IPFS network is working just fine though, and I'm not aware of IPFS ever having any global outages. Though then again, I'm not aware of any on bittorrent either
The concept of "being down" doesn't really apply to protocols. IPFS/BitTorrent never being down is a bit like saying that TCP/HTTP has never been down. Individual servers/client can have connection issues, but obviously won't affect clients not connected to those, and is not because of the protocols themselves.
Not to state the obvious, but... if a big centralized company built a Cloudflare for IPFS to make it easy for the masses to adopt, that company could go down just as easy as Cloudflare.
How so? Somebody links to a webpage, decentralized resolver converts it to an IPFS hash, which the client queries for any providers of that hash, and retrieves directly from them. No central authority necessary
jgrahamc, just some feedback about trying to reach support:
1. I could see my site down, including cloudflare.com with nginx 500 errors, via Sydney AU
2. Logged in to dashboard (via Melbourne AU) that worked; and so was thinking it was an issue with Sydney Cloudflare
My experience with Cloudflare has been in the past sometimes servers in some regions have issues and its a transient thing.
3. Status page showed no problems, so I went to "Contact support" and went around in circles (really frustrating) via the "Contact support" link moving me between Community forums, Support ticket, etc. I then see Chat is an option is available with a Business plan, so I upgrade to that, hoping for some real-time support to alert of the Sydney issue.
4. Return to the "Contact support" page after upgrading the plan, but the Chat option still not present on the support screen (and help articles say to return to support page and click "Chat" but it never shows up).
5. Come across https://community.cloudflare.com/t/cloudflare-for-teams-chat... searching for why I can't see Chat as an option on the support forum saying they're on paid plans with no chat support and its not showing up, so I just give up assuming its broken
6. Open HackerNews and see its at the top. A few moments later the status page reflects the outage.
I still can't see the Chat option so I've down-graded the plan again.
Their whole support experience is really not great. Used it a few times those last few years, and I rarely got out satisfied at all.
For example, they seems to have what I assume is a separated DB for CF users and CF support users, but with one shared login system. But if you end up updating your email on CF, it's not reflected on their support system and all your tickets are gonna be refused because of the email mismatch, completely disregarding the fact that you just logged in via your CF account. And no way to update it from the support part of course.
At times like this and the big Fastly outage roughly a year ago, choosing to host on a simple, independent bare-metal box doesn't seem like such a bad strategy (as long as one has backups for disaster recovery, of course). Sure, other things can cause downtime in that kind of infrastructure, but at least my service isn't likely to be taken offline by someone else's configuration error or deployment gone wrong.
I have been running my business on Hetzner bare-metal servers for the last 7 years. During that time there were several brief network outages, on the order of minutes. I think one network outage was 30 minutes. Other than that, no problems.
Given the price and performance difference between bare-metal and everything else, I am puzzled as to why small businesses that do not need scalability do not go with bare metal. And given the speeds of todays hardware, if you are not doing something stupid and you have a B2B SaaS, it's really difficult to need "scalability" beyond several bare-metal servers.
To be clear, I do not consider my bare-metal boxes "reliable", I have a multi-server setup managed by ansible, with a distributed database, and I can take a single-node failure without problems. I also have a staging setup that can be converted to production quickly, and a terraform setup that can quickly spin up a Digital Ocean cluster if needed.
Your box running your web server is far less complicated than using a CDN and worrying about countless additional points of failure. Network problems are only a minor risk.
My Internet goes down at least twice a year and my electricity goes down even more, specially in the winter. So no, this is not more reliable than cloudflare.
In a discussion about using a CDN, it's implicit that it represents an addition to "professional" hosting with servers in a well managed data center that has, at least, redundant high-bandwidth network connections, not to a domestic network connection.
Note that your home network could be good enough for a personal web site that nobody pays you to respect a SLA on.
No, we're talking about a colocation provider, or a leased dedicated server provider. I went with OVHcloud US for my latest deployment. HN is at m5hosting.com.
You seem to imply that the options are only cloudflare or your apartment. This simply isn't true: there are a plethora of companies that will lease you a dedicated box of some Us in one of their racks, as the sibling commenter replies. Alternatively, you can search for co-location services. Options range from 1U/2U co-location, to half rack units, to full racks, to dedicated areas of the datacentre ranging from cages to whole rooms (I've been in at least one datacentre where an entire room was under separate access control and leased to one customer only).
Usually datacentres are located quite strategically. For example the location of many datacentres in Zürich corresponds with two separate power supply grids that meet (so they can pull from both).
Some of the companies involved are resellers and don't actually operate the datacentres they use. Others actually do. Usually the service is more or less the same, from the point of view of renting a 1U, or co-locating one.
If you want reliability features of a datacentre, e.g. for your office services, but might move, you may find your local city surprising. In Manchester, UK, there's a large amount of dark fibre under the city (fibre that is laid, but not in use), owned by some of the DC companies. Sometimes you can connect your office to said datacentre via dedicated fibre.
We’ve been on Hetzner for several years now. So far the only outages we had were from us moving servers (yeah, we don’t have high availability or load balancing, just a single beefy dedicated server). So, yes?
Last company I worked for, we had many Hetzner servers. We had many drive failures and CPU fan failures. It's fine if you can deal with a relatively high chance of hardware failure.
Perhaps not, but those who want to avoid Cloudflare for technical or idealogical reasons won't realistically expect identical performance from smaller alternatives. Same as using Linux. People use it knowing fully well it may not support the latest & greatest consumer gadgets like Windows, but unless people use alternatives despite minor downsides, we shouldn't be distressed when we eventually reach a point of global near-monopoly.
I guess it depends. If you scale up and down via the API and can’t access the API .. you have a pretty good chance of a down scenario if you had a traffic spike you can’t scale for.
Yeah, they'd also be dependent on their ISP still if they're "fully independent". Good luck dealing with massive traffic spikes on a single bare-metal box and good luck maintaining a similar uptime to cloudflare's 98.84% uptime lol
Most (or at least many) colo facilities have multiple transit ISPs, some are big enough to have decent peering as well.
I'm assuming 98.84% uptime is a joke? Less than 4+ days of downtime is something I could manage from a home connection most years, if I had a static IP.
Interestingly enough, I'm already logged in, and the homepage as well as the rest of the Linode dashboard are operational. It seems only the login page is down.
Today's actually the first time my site is down and it's Cloudflare's fault instead of my own. Obviously this outage is huge, but so far I've been really impressed with their reliability.
What is your setup to where you are isolated from "other person making a mistake"? Even if you're a box in a colocated datacenter you're still able to get knocked off the net from some maintenance on the surrounding pipes. Hell, hosting your own box doesn't cause Comcast DNS issues to not knock off a bunch of people either.
I do think there is some holistic overview of hosting stuff on the internet, where you could label each extra actor that can break things, mitigation strategies, and costs of such. Someone better than me would be able to place relative risk (and I think in that model laying out various provider uptimes/issues would be great!) and offer a smart way of dealing with the buy vs. build question on this.
If you need fault tolerance/isolation, you want to have a second box in a different colo (preferably in a different city; a different coast/continent if it's important).
If you can live with dns round robin between the two, then you can easily host the DNS with multiple providers and avoid SPOF (could maybe host it on the two boxes you already have, too). You're still at risk of domain registry/registrar failures, and failures of their tld nameservers (very rare for well run tlds) and the root servers (not sure if they ever had a widespread failure). And of course, simultaneous failure of both locations isn't impossible, just less likely.
On Comcast DNS failures... Most of the recent ones I've heard of manifested as users on Comcast can't resolve X, but were really X had bad DNSSEC records and Comcast DNS refused to return records that weren't signed properly. It's easy to avoid that by not using DNSSEC.
In the general case of working despite bad ISP dns, you can't do much (anything?) for web browsers, but if you build apps, you can hard code fallback IPs for when DNS doesn't work... But you need to have IPs that stick around for the lifetime of your app downloads.
Fair point. Still, based on my anecdotal experience using leased dedicated servers, mistakes at that networking layer seem to happen less often than mistakes that take AWS us-east-1 or one of the big CDNs offline.
It does feel like more "hosted" environments are trying to do more fancy stuff inside the network, so have more failure cases. Or perhaps services that do a lot of things, even if you end up just using simple server components.
I still have a fun memory of half of IBM Cloud's servers falling over, meaning that our production app was luckily still up but our staging server fell over. I could get to their website, but their login stuff was all messed up. I believe that one was also a "routing stuff got messed up" issue....
Puhleeze. DNSimple uses Cloudflare DNS firewall product - this is not a secret. If you don’t like it use an alternative DNS provider, there are plenty.
Yeah but if the internet is widely down, the network effect is that people probably aren't using your site because everything else is down and they'll wait for confirmation from sites like facebook and their internet banking and netflix to make sure things are back to normal.
> The internet is an interconnected web of dependencies.
Ironically this is exactly what increasingly centralisation weakens. The huge cloud providers have eroded "an interconnected web of dependencies" into few huge server farms servicing everyone else.
> Their status page is a joke, likely crippled to reduce legal liability, but at this point it's just an outright misrepresentation.
It's just Atlassian Statuspage, which is a manually-updated incident response system. Unlike AWS, Cloudflare actually makes an effort to update it fairly quickly, but it can still be slow-to-update when something is immediately wrong.
> Their status page is a joke, likely crippled to reduce legal liability, but at this point it's just an outright misrepresentation.
It's fairly standard practice these days for status pages to be manually updated. The difficulty with having them be automatically updated is that for it to be useful that system needs to have a greater reliability than the thing it's monitoring. The signal to noise ratio is otherwise a bit ridiculous.
Reddit Status [1] isn't perfect, but it's miles better than a static page saying everything is operational while everything is in-fact inaccessible. That it took 30 minutes for the page saying everything is fine to be updated with a warning that there is an issue (while almost all of the services and regions remained marked as operational) only makes the ineffectiveness of that page more blatant.
It goes without saying that the monitoring system must be separate from what it's monitoring and must be more reliable. Compared to running a CDN for half of the internet, automated monitoring is table-stakes.
I wasted a bunch of time debugging the HTTP 500 errors on my site before I realized everything is 100% OK on my end, and that it's Cloudflare returning the error not my servers.
Ditto - I'm sitting here, wtf I'm not running Nginx on my blog, but I'm getting an Nginx response, hit IP directly....oooh.... right that doesn't make sense it's working fine. Cloudflare can't be down, that's next to, wait, status page (to their credit it's got a status note). HN here we go...
Would it be possible to adjust that 500 page to include an indication that it originates from Cloudflare, for the case that an outage like this happens again in the future?
This one seems be due to a hug of death rather than isitdownrightnow.com being behind Cloudflare, probably from too many people checking on all the other sites that are down.
I say this because: (1) it eventually loaded for me after I tried a few times and gave it time to load, and (2) its certificate doesn't report to be from Cloudflare but other sites I've checked that are down do
Yes, I started checking router and wondering if anyone had managed to install some sort of exploit on it as I was getting that 500 nginx page from half a dozen Web sites.
I had tinkered with my network settings just before this to troubleshoot an entirely unrelated problem so for a minute there I thought I broke everything lol
Their core service (DNS and web proxying) should see an outage once every 10 years or less. Much like Google Search (which is a far more complex service).
Yet it seems we get an outage more frequently than once a year. In my opinion, that makes the service too unreliable to base my business off - it's not like I can failover to another provider while they're down.
I'd love to use another company, but there's no one offering the same for the same price tag. Most of their services are free and they charge very little for the rest. Especially if you have a traffic heavy page with little revenue, Cloudflare is pretty much the only solution for CDN, WAF etc. All the others charge for traffic and cost a fortune.
All of my home's Ring cameras have been inaccessible this entire time. It's not that big of a deal for me because I planned for that eventuality, but a lot of people have not.
If you run a critical service (like Ring) and your infra is tied to CloudFlare - you're stuck! There is nothing at all you can do. That's freaking scary man. If I was working infra at Ring I'd much prefer to get paged and start fixing the problem. There are very few problems that can't be fixed in 15 minutes if you plan well for failover...
Explains why the online training course I was part way through stopped working! Amusing that the quickest diagnosis came from skimming the headlines here :)
Several sites I was trying to access all went down at the same time. Came to Hacker News to see what was up - not disappointed!
*Including America's Cardroom, perhaps the biggest "offshore" US poker site. I can promise you that there are a lot of people who were playing in tournaments that are very unhappy right now. New York here.
It's times like these when I'm appreciative of the simplicity of the HN tech stack. Was talking to some people on discord when it went down and then noticed some other websites were down. Came right to HN to see 5 different threads about this. Will be curious to see what the cause of the issue turns out to be
LMAO of course when every single thing I tried to use won't load or gives me a useless default 500 nginx error page, I find out why here. Figured it was CloudFlare. Single point of failure, not once.
The worst part is cloud infrastructure companies like DigitalOcean and Linode are both down simply because for some reason they can't build their own infrastructure to not rely on Cloudflare lol.
I think they rather got overwhelmed by many more requests reaching them that would usually hit Cloudflare. Also, as is widely known, people tend to hammer F5 when something like this happens, additionally increasing the number of requests.
If only we could come up with a globally distributed set of networks and systems that could be run by millions of entities that don't rely on each other to keep working. Oh no wait...
This was very educational, all of a sudden I couldn't reach 60% of all websites I normally visit everyday. I guess this is the cost of laziness under the guise of DDOS protection.
"Investigating - Cloudflare is investigating wide-spread issues with our services and/or network.
Users may experience errors or timeouts reaching Cloudflare’s network or services.
We will update this status page to clarify the scope of impact as we continue the investigation. The next update should be expected within 15 minutes."
Can someone link me to some information that explains what Cloudflare is besides being a CDN?
Like I understand how websites can be served using a CDN and how a lot of the internet depends on that... but I don't see how gaming services like Valorant or cloud providers like AWS or chat room like Discord depend on Cloudflare.
Their WAF is very useful, it makes it very easy to defend against attacks without paying anything. In general, their big plus point is that they offer many services for free, making it easier to onboard.
But by now they offer lots of services, although I believe WAF and CDN are probably still the most important to many.
Sites returning 500 is one thing, people will understand that's an error. Site can't be found because DNS is out is not one that the generic public will start to debug, but instead they'll walk away from the site, sometimes for good.
Question: how could be (temporary) DNS errors be made nicer?
I was setting up some DNS for a site, when it suddenly stopped working after 30mins of missing with the settings and googling I gave up come in here, and see this.
My sites that are just using DNS are working fine, it's only those with the orange cloud, proxy turned on that are broken.
Shouldn't have happened in the first place.
Should have had something that worked on their own website to indicate the service is down, not needing to come to a somewhat obscure tech forum to find out the details.
Because of state and federal regulations, the path through back doors, fire exits and water coolers are always shorter from engineer's desks to the planetary atmosphere than it is through the front door and reception areas.
To be fair my information was not accurate. It was fast but when I said it was a problem with our "backbone" I was wrong (it was a networking problem but not the backbone). I favour speed over accuracy here, but the status page wants to be fast and accurate.
My main interest was that you were aware and that a fix was on the way. That's the difference between having to desperately act myself or just sit tight and placate clients. So, I appreciated your original comment!
The comment on HN had more useful information (that the issue was understood and a fix coming) before that status page then updated. I think that's their point.
Prior to that, it was some time (in the "all my sites are wrecked" timescale) before the status page had any indication of an outage.
The way I read their complaint was that they should have something on their website to indicate they were down. Anyway, at the time they complained, the status page also already said that the issue was identified and a fix was being rolled out.
Their post was saying that the dedicated status domain should be the first place to get useful information. There were multiple new threads on HN before the status page was updated at all. I'm sure there are legal reasons, but it's not ideal.
Then there was the CTO's (appreciated!) comment prior to the status page's second update with information suggesting this would be resolved soon (which IMO is the information everyone needs to report back to clients, bosses, etc).
That the status page was subsequently updated prior to OP's complaint isn't really relevant. It's still a point of discussion, whether someone comments immediately or later, right?
Maybe you should try first to actually go into their status page[1]. It is showing a global service disruption since 06:43 UTC; about 20 minutes since you wrote this
The entire day yesterday performance with Cloudflare was extremely sluggish. Pages which relied on it, even if it's only loading a JS-file from the CDN, would hang for tens of seconds.
I cannot access science.org, quora.com, substack.com at the moment. It shows 500 Internal Server Error. Didn't know why but now it is clear. I guess I just wait until it is fixed.
Statuspage seems to be useless, I was just trying to get the status via multiple networks and my mobile network. Ironically other downdetector services are also down.
Haha... I got pinged on my phone a site I manage is down, trying to figure out what's wrong with it, noticing other sites down, realizing it's Cloudflare
Why do people ask questions like this? You know the answer. This company offered products or services superior to alternatives so people elected to use them.
Wow, for me it looked like the world had gone mad. This is a reminder to not only rely on 1.1.1.1 for DNS resolution in PiHole.
I host most of my services locally, but ironically could not connect to my own homelab. I use a dedicated domain with DynDNS and did not configure the network and DNS without reliance on external DNS. Surely it's infinitely more likely for me to make a mistake, right?
yeah, and if cloudflare could make their anti-bot "verification" interoperable with noscript/basic (x)html browsers, and not force those grotesquely and absurdely massive google (blink/geeko) and apple(webkit) web engines, that would be less criminal.
Wishing Cloudflare ops teams the best to recover fast from this outage. Meanwhile, we urge customers to check out www.cdnreserve.com , and implement a sound CDN backup strategy (auto-failover) when the primary CDN suffers an outage.
I absolutely agree and very respectfully so. No one is immune to outages. Well said. CDNReserve is designed in a way, that if the outage occurs on one platform it will map the traffic to failover CDN and if the failover/backup CDN suffers an outage, the traffic will be shifted to the primary CDN using CDNReserve. Its built on the premise that the likelihood of two CDNs having outage at the same time is close to ZERO.
The likelihood of CDNReserve having an outage on the other hand is 100%.
You aren't the first to come up with the idea of a CDN traffic director (I built one), and you'll soon discover customers recognize you are just another single point of failure and not the solution. Best to focus on the things other companies in the space market on, bill optimization, latency optimization, etc.
Agreed. The likelihood of any platform having an outage is 100%. But the likelihood of two networks having the outage at the same time is close to zero. It's awesome that you built a similar solution in the past. It would be great to jump on a call and learn from your experience if you are open to it.
A lot of my Australian colleagues were saying a lot of things were down, including all of our websites however me being in NZ, I was able to visit them.
So I do think Cloudflare actually is a bit more decentralised than we give them credit for really.
Just the fact that they messaged here in the HN thread about what was happening, what they knew, and how they were gonna fix it. That's just _awesome_
Kudos to them. I can't wait to see their after report.
Someone should a website that collates the approx. 90M times this sentiment has been made on this website (a good chunk of me making it) just a reminder how nothing somehow changes on this front the moment it comes back up and people go back to relying on single SAAS's for everything.
To be fair this is not relying on a single SaaS for everything but many people relying on a single SaaS. I mean if you want to use a reverse proxy/CDN, you must rely on someone.
Our key customer facing services are a 99.995% uptime (and a total of 2 or fewer incidents per year of "any length"), which means once you start concatenating services with 99.995% SLAs you aren't there.
How that SLA measures a 2 second outage for some customers is a separate thing, and sort of shows how meaningless these things can be on the internet (if you lose service for 10% of your potential customers is that an outage? How about 90%? How do you know how many were lost).
Measuring outages doesn't seem so meaningless as long as you money seems inaccessible.
Their main site went down for about 20 hours a couple weeks ago because their hosting provider went down. They deployed an HTTPS only static site in its stead, so at first blush it looked like they deployed nothing. Great when you're trying to find contact information hosted on that site.
Their online banking site leveraged Cloudflare, so obviously they just rode that outage out with no notifications, etc.
What if for some reason a single /24 was unreachable from the site (say an errant route for 12.85.25.0/24 somehow got in the path). How would you even know that was a problem - how many customers are on that /24, how would I measure their failed attempts to connect?
I have a remote office in India on Tata. The other day it had access to much of the internet, but due to a fibre break in the Mederteranian it didn't have access to end points in Europe for a good 20 seconds.
However the other link on a different ISP remained working at that time.
Does that count as an outage? If I wasn't actively monitoring that link with a high resolution would I even know about it?
I'd argue you're starting from a few orders of magnitude more competency than the credit union was. Their non-banking site was hosted by some podunk company in Texas with no sense of redundancy anywhere. Their provider had a near total networking outage and the credit union had no plan to recover from that.
Insofar as proactively monitoring a single /24, you (probably) don't. I don't think it's (usually) a company's job to monitor their customer's ISPs. The failures that "my" credit union had were due to their own choice in infra (Armor, Cloudflare). When Sonic nuked my config on their DSLAM after some maintenance I raised an issue with Sonic not with whatever other companies became inaccessible as a result.
> Does that count as an outage?
My POV may very well differ from whatever contracts and SLAs you have in place, but yeah maybe. If you can't fail over to the alternative ISP then yes that's an outage. Of course a trans-atlantic fiber break would also likely be a lot more noticeable than fat fingering a route for a /24. And sure, I've been stuck at megacorp when the VPN started handing out addresses in a new subnet but our department's networking team hadn't caught up. That's why you listen to your customers instead of throwing out a "someone else screwed up there's nothing we can do" response.
Me personally I don't think that a 20 minute banking outage is a massive problem (I've long since moved my money elsewhere), even the 20 hour outage was relatively minor. It just speaks to the unwillingness of the credit union to be highly available. They knew of the Armor outage and didn't actually test the remediation. I assume they didn't know about the Cloudflare outage. Both worry me. What happens when they're faced with a total failure of their online banking system?
But it isn't an outage. My monitoring point in Singapore could reach both ends, they just couldn't talk to each over, due to a routing issue on a third party network over the internet.
On my own network which I control I accept that if a circuit breaks I'll have a 1, maybe 2 second outage while traffic reroutes. For some of my services that's would be a problem, for others it's not. If facebook loads 2 seconds later, nobody cares. If the winning penalty in the world cup final blacks out, that's a big problem.
I'm new to this whole thing. Can u point me on how I don't depend 100% on CF, if its DNS is down? Is there such a service? (kinda like load balancing, but with DNS?)
DNS is easy. You can (and must) have multiple nameservers for your domain. Just use different companies (and different regions) and if one goes down the others will still resolve.
I remember when some key part of AWS EC2--EBS in us-east-1 maybe?--was down for a few days straight. Honestly, the main thing it taught me was "if you are honest with your customers they will mostly just come back later and buy everything they didn't buy today".
Yes and no. Obviously not great when everything goes down, but I find a strange sense of solace and calm when I know there’s a lot of people in the same boat and there’s little I can do but wait.
Yeah same lol. I saw that my SaaS services were not working and I got stressed thinking why all my EC2 instances went down at the same time. I checked down detector for EC2 and it reported that Cloudflare is down. I breathed a sigh of relief thinking that the (almost) whole of internet is down - nothing that I can do here.
It's quite ironic that the Internet was designated to withstand nuclear attack, yet with how much everyone started using "cloud" a stupid configuration mistake in an important company can put it on its knees.
We should really rethink that constant reliance on single point of failure.
I wonder if it really was though. I’d think that these centralised services go down less than the self hosted stuff previously. Is it better to have more overall uptime but downtime means everything stops, or random downtimes of individual sites that adds up to more downtime.
I mean if large websites like Notion or Medium had used IPFS instead, there would be no central point of failure, and web pages would still be available from distributed hosts
CLoudflare just offer great services. Its straight up fact that even their free model is extremely generous. There is no big conspiracy to 'take over the internet' but when the product is good, the product is good.
CloudFlare should be run by the CIA or something - asthonishing MITM opportunities. The only clear sign the CIA is not deeply involved is that CloudFlare is far too competent.
It blows my mind how most of the otherwise savvy readers of HN completely gloss over the fact that Cloudflare unwraps TLS on most their internet traffic.
I trust that the current leadership might not do something evil, but they are publicly traded. At some point a group of investors are going to figure out that merging Cloudflare with an advertising network would create a level of user targeting that Google and Facebook could never dream of.
Governments in Europe and elsewhere are already working on legislation to mitigate e2e encryption by law. Regulating things like cloudflare as they have already done with ISPs to hand over data is not even much of an imagination leap. For example in the UK all time:srcip:destip:user data must be kept for 1 year for every residential ISP and provided to government departments (not even law enforcement) through a standard system
Not a 'big conspiracy'. It's the business model, isn't it? Or isn't CF going for the biggest marketshare and maximizing profits on that, like all the others?
It's certainly any business' model to grow as big as possible, but it's a hard business model to implement so competition is hard to find. I just can't blame CF for that imo.
Maybe they are more decentralized than what we are giving them credit. I'm having different error messages (nginx, dns, 404) on different websites. Not sure if it's a full breakdown of their systems or a coordinated attack.
Should be back up everywhere.