I once worked a company that had a data loss issue. There was nothing else we could do, we had exhausted every option we had over almost 40 hours. At the end of the second day, it was decided to restore from backup.
We had done this before, as a test. It took about 12 hours to restore the data and another 12 hours to import the data and get back up and running.
One small thing was different this time, and it had huge consequences. As a cost-saving measure, an engineer had changed the location of our backups to the cold-storage tier offered by our cloud provider. All backups, not just 'old' ones.
This added 2 additional days to our recovery time, for a total of five days. Interestingly enough, even though we offered a full month's refund to all of our customers, not even half of them took us up on it.
Hi, I'm Mike and I work in Engineering at Atlassian. Here's our approach to backup and data management: https://www.atlassian.com/trust/security/data-management - we certainly have the backups and have a restore process that we keep to. However, this incident stressed our ability to do this at scale, which has led to the very long times to restore.
Hey Mike; Not dumping on you personally, but the RTO claims to be 6 hours. I can understand that being a target, but we're at 32X that RTO target, with a communicated target date of another 12 or so days IIRC. That's literally two orders of magnitude longer than the RTO. I don't think any rational person would take that document seriously at this point.
I'll also ask (since nobody else has answered, I may as well ask you as well):
1. Are the customers actually being restored from backups (and additionally, by a standard process)?
2. Will the recovery also include our integrations, API keys, configuration and customization?
Hi Ranteki, you're right that the RTO for this incident is far longer than any of the ones listed on the doc I linked above. That's because our RPO/RTO targets are set at the service level and not at the level of a "customer". This is part of the problem and demonstrates a gap both in what the doc is meant to express and a gap in our automation. Both will be reviewed in the PIR.
Also, the answer to (1) and (2) is yes.
A friend in Atlassian engineering said the numbers on the trust site are closer to wishful thinking than actual capabilities, and that there has been an engineering wide disaster recovery project running because things were in such bad shape. The recovery part hasn't even started. If Atlassian could actually restore full products in under six hours, they should have been able to restore a second copy of the products exclusively for the impacted customers.
Nah. The RTO/RPO assumes that only one customer that has a failure big enough to require a restore.
When the entire service is hosed, that's a totally different set of circumstances, and you have to look at what the RTO/RPO are for basically restoring the entire service for all customers. And since the have more than a thousand customers, it totally makes sense that it would take orders of magnitude longer to restore the entire service.
I think this document and incident is a decent example of common DR planning failure patterns.
It is explained here that Atlassian runs regular DR planning meetings with the engineers spending time planing out potential scenarios, as well as quarterly tests of backups and tracking findings from them.
So, with those two things happening, I the imagine recovery time objectives of <6 hours was taking a typical "we deleted data from a bad script run affecting a lot of customers" scenario into account with the metrics from the quarterly backup tests.
That doesn't even come close to the recovery time we are currently seeing now however. We're coming up on 2 orders of magnitude more than that.
The above doc seems pretty far our of line with what is currently happening.
It's 400 tenants scattered across all their servers. So they are most likely having to build out servers to pull the data then put it in place. 10x the problem that just restoring a single server would be.
This is why I love GCP Cloud Storage. The "colder" tiers are cheaper, and reads simply cost a lot more from there, but they don't slow them down and take days to restore. You pay with dollars, not time for restoring those GCS backups. e.g. Coldline [1] simply has reduced availability in exchange for being cheaper (99.9-99.95% availability, so 43min/mo, way less than "two days").
Not every business can afford to go one month without income. What's the best thing for customers? Have the business go bankrupt and irremediably lose access to the service?
Fastmail gave 1 month free service to about 2/3 of our customers after a major disk failure that led to about a week of downtime for them as we recovered from backups in... 2005ish I think. Long time ago - it was a pretty major hit and the wave in income is still visible all these years later as a lean month where there's no renewals from that batch! Definitely the right thing to do though.
400 clients, but how much of their revenue? Were they all small clients? Not to mention longer tail effects of people moving away to competitors even if they werent directly affected?
Good faith would be to lose all of that money to people who are already your customers.
Business-wise would be to stay in their good graces and keep those customers by offering the refund, but you don't lose any money to those who either don't care or won't move to a competitor.
25 years ago the clutch in my beater truck was slipping. I was 16 years old, making $50 a week and had very little in savings. I took that truck to a shop within walking distance of my job.
2 hours later I walked back to see what they found. I figured it would be several hundred dollars for a new clutch, and I'd have to borrow money or something to get it done. I talked to the owner who told be it was an adjustment on the cable. Just needed to be scootched up a bit and it was probably good for another 30k miles.
When I asked him how much I owed, he laughed at me and said, "For that? Not worth writing it up. No charge. You want me to show you how to do it yourself next time?"
The shop could very easily have charged me 1 hour of labor at their standard rate, maybe $75 or so. Plus a diagnostic or test drive fee. Whatever. He could have told me, "$123.98" and I would have paid it. I wouldn't even have been mad. But I sure as hell wouldn't have remembered the experience so clearly. Nor would I have told a dozen people over the years to take their cars there. And I definitely would not have driven 20 miles out of my way to return to that shop in the future years.
Being cynical about this stuff will hurt your brand. It's not obvious. It doesn't show up on the earnings report as a line item. This is service segmentation that seems like a no-brainer to a clueless MBA, but actually matters in the long run. How people view your brand is immensely important.
Not forcing customers you already screwed over to then spend more time chasing a refund is not only the right thing to do, it's also good business.
Your anecdote is nice, and sure it can be good advertising to give stuff away for free, But it doesn't really apply here.
If you were charged $123.98 and you said, "hey, I told you where the problem was, why am I being charged a diagnostics and driving fee?" and they corrected it by telling you the whole thing is on the house, is that not good business sense?
Even by your own admission, you would have gladly paid that $123.98 with no issues and you wouldn't have been mad about it. So from a business perspective, if they can provide a service, get paid for it, and the customer has no qualms or issues with the transaction whatsoever, in what way is that hurting the brand or being cynical? I think that's a much more business-wise action to take than to give away your services.
> If you were charged $123.98 and you said, "hey, I told you where the problem was, why am I being charged a diagnostics and driving fee?" and they corrected it by telling you the whole thing is on the house, is that not good business sense?
No. I'll be happy that I saved on the money, but I won't trust them in the future. They're now "the place that tries to get away with things" in my mental Rolodex. Better to stick with the fee and know their value. (I didn't tell them where the problem was. All I knew was that the clutch wasn't grabbing anymore. I assumed it needed a whole new clutch.)
> Even by your own admission, you would have gladly paid that $123.98 with no issues and you wouldn't have been mad about it. So from a business perspective, if they can provide a service, get paid for it, and the customer has no qualms or issues with the transaction whatsoever, in what way is that hurting the brand or being cynical?
It would have been a fine decision, sure. But in that case that would likely have been the only business I did with them. Not out of spite or anger, but because I'd have no reason to pick them for future business. I would instead ask friends for recommendations, or pick some place closer to my future residences.
But what actually happened was that I was the one steering people to them. I also went out of my way to return to them for brake jobs, simple oil changes, etc. I was a loyal customer, and probably spent or caused others to spend over $5,000 there.
He had absolutely know way of knowing that would result. But if you just treat people right, the way you'd want them to treat you, you build a reputation. It pays back.
I know this story comes off a bit pollyanna. I get it. For a cynical and non-altruistic explanation: when it takes a technician literally 5 minutes to twist an adjustment nut and verify that was all there was to it, stop and think about the bigger opportunity before you robotically mark '1.00' in the "LBR HRS" field on an invoice. Especially if you're operating in a field that's notorious for rip offs.
> I think that's a much more business-wise action to take than to give away your services.
I'm not saying businesses should give away major services. But they should avoid the temptation to nickel-and-dime as well. That's on the other end of the optimization curve. Not good business.
> He had absolutely know way of knowing that would result.
I think he absolutely knew that building trust is key to solid, long-term, repeat business - not only from the direct customer whose trust he has earned but also the zero-effort initial positive trust-balance he will have with his future/potential customers, even before he has done anything for them, just via word-of-mouth referrals. Such a simple concept but it just doesn't compute for some people.
> But if you just treat people right, the way you'd want them to treat you, you build a reputation. It pays back.
Reducing the impact analysis within a long running relationship to a single transaction is too narrow. People observe how other people are treated and draw their conclusions even if not impacted. People may tolerate some abuse but it moves them closer to leaving next time. Money lost in the outage may provide for a budget creation to look for an alternative.
A lot of people making those decisions don’t care about a refund because it’s other people’s money anyway. In my experience only small companies care about that.
Focussing on communicating open and honestly allows them to explain the crap they’re going through because of your mistakes to their bosses, so in fact you can help them save their asses, and they’ll save your ass in return. This is much more important and valuable than a refund.
So you should ALWAYS communicate open and honestly, and offer the refund as an option for clients who do not have a boss to account to.
I've seen cases where it was actually _more_ work for a business to process a refund. That money has to go all the way back through accounting/financing, be re-added to budgets for the appropriate groups, etc. It's not something done all the time so it takes extra time for those working on it. It's not like a Visa credit card getting a refund for a wrong coffee order.
Did I ever tell you guys about the time we accidentally nuked all the mailboxes for all the million-plus users on The Global Network Navigator (GNN) site? And how the restore process failed for us?
This hasn't been written up at The Register yet, so I don't have a single URL I can share with you.
I once worked a company that had a data loss issue. There was nothing else we could do, we had exhausted every option we had over almost 40 hours. At the end of the second day, it was decided to restore from backup.
We had done this before, as a test. It took about 12 hours to restore the data and another 12 hours to import the data and get back up and running.
One small thing was different this time, and it had huge consequences. As a cost-saving measure, an engineer had changed the location of our backups to the cold-storage tier offered by our cloud provider. All backups, not just 'old' ones.
This added 2 additional days to our recovery time, for a total of five days. Interestingly enough, even though we offered a full month's refund to all of our customers, not even half of them took us up on it.