I have worked with several small clients to migrate away from AWS/Azure instances onto dedicated hardware from Hetzner or IBM "Bare Metal" hardware.
The question I ask first is: as a company, what is an acceptable downtime per year?
I give some napkin calculated figures for 95%, 99%, 99.9% and 99.99% to show how both cost and complexity can skyrocket when chasing 9s.
They soon realise that a pair of live/standby servers might be more than suitable for their business needs at that particular time (and for the foreseeable future).
There is an untapped market of clients moving _away_ from the cloud.
SLA is overrated. SLA mostly relates to "unplanned downtime" so if you need to often fix things, just schedule downtime, mess around with it and bring it back up.
Also, we have seen both cloud and non-cloud hosts having significant downtime more than their SLA but just put it down to a "small subset of our customers" so they don't have to do anything.
It's a bit like my Student Loan guarantee, "Do the paperwork and we guarantee you will have the loan on time". The loan was not paid, "I thought you guaranteed it?" "We do but we made a mistake" "So what do I get because of the guarantee?" "Nothing". Cheers!
I have never seen a publicly-advertised SLA practically worth anything, by my reckoning—whether offered to all customers or an extra that you’d pay for. (Privately-arranged SLAs I can’t comment on. They could potentially have actually meaningful penalties.)
Vultr’s, as an example I’m familiar with, being a customer, but which I believe is pretty typical:
• 100% uptime, except for up to ten minutes’ downtime (per event) with 24 hours’ notice or if they decide there was a time-critical patch or update. (I have one VPS with Vultr, and got notice of possible outages—identified purely by searching “vultr service alert” in my email—8× in 2022, 3× in 2021, 14× in 2020, 9× in 2019. No idea how many of them led to actual outage.)
• They’ll give you credits according to a particular schedule, 24–144× as much as the outage time (capped at a month’s worth after a 7h outage, which is actually considerably better than most SLAs I’ve ever read). Never mind the fact that if you’re running business on this and actually depending on the SLA, you’re probably losing a lot more than what you’re going to get credited for.
• Onus of reporting outages and requesting credits is on you, by submitting a support ticket manually and explicitly requesting credit. So the vast majority of SLA breaches (>99.9%, I expect; I don’t care to speculate how many more nines could be added) will never actually be compensated. And determination of whether an eligible outage occurred is at their sole discretion, so that they could frankly get away with denying everything all the time if they wanted to.
Such SLAs basically just completely lack fangs. I suppose you’d want something along the lines of insurance instead of an SLA, if it all mattered to you.
> They soon realise that a pair of live/standby servers might be more than suitable for their business needs at that particular time (and for the foreseeable future).
I've worked at one of those companies that have the live/standby model in place.
The problem is, how to switch load from live to stand-by in case of problem often requires manual intervention and a procedure.
The procedure must be tested from time to time, and adjusted according to changes.
Oh and the live and standby environments must be kept in sync...
My go to setup if I need more uptime that a single server running in google cloud with live migration grants, is a three node galera cluster, with an A-record pointing to each node, which also runs the application in addition to the database. You can do rolling updates without any downtime, and I've even had setups like this go years without downtime. It isn't perfect but it works very well and obviates having to worry about things like stand-by switchover.
IME many companies claim 99.99+ uptime but then the penalties are trivial. If a 99.99 SLA is busted with an hour of downtime in a month but the penalty is 5% bill credit, the company just lost $500 on $10k revenue, assuming that:
A) Customers actually chase the credit, which (again IME) many companies make very difficult
B) The downtime is very clearly complete downtime. I've seen instances where a mobile app is completely down (but the web product works) or a key API is down (but the core product works) or there are delays in processing data (but eventually things are coming through). All of these can cause downstream downtime to customers but may not be covered by a "downtime" SLA.
Once a company claimed nine 9s (99.9999999%, 0.03 seconds down per year) uptime on their new cloud service to me. When pressed how they came up with the number their measurement they said they were measuring the percentage of time the login webpage loaded (not that you could log in or things worked inside the page and app) and the https://uptime.is/ tool only went up to nine 9's.
The question I ask first is: as a company, what is an acceptable downtime per year?
I give some napkin calculated figures for 95%, 99%, 99.9% and 99.99% to show how both cost and complexity can skyrocket when chasing 9s.
They soon realise that a pair of live/standby servers might be more than suitable for their business needs at that particular time (and for the foreseeable future).
There is an untapped market of clients moving _away_ from the cloud.