Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The fact it's been so long and they still haven't revealed and explained the root cause of the outage

They did last night: https://www.atlassian.com/engineering/april-2022-outage-upda...



> Faulty script. Second, the script we used provided both the "mark for deletion" capability used in normal day-to-day operations (where recoverability is desirable), and the "permanently delete" capability that is required to permanently remove data when required for compliance reasons. The script was executed with the wrong execution mode and the wrong list of IDs. The result was that sites for approximately 400 customers were improperly deleted.

Ouch. I hope no one person got the blame. This is a systemic failure. Regardless, my regards to the engineers involved.


I don't want to assume too much, since the details are sparse. But I know for a fact that few of my current coworkers know a thing about writing tooling code. It's becoming a bit of a lost art.

Here's the way such a script should be done. You have a dry-run flag. Or, better yet, make the script dry-run only. What this script does is it checks the database, gathers actions, and then sends those actions to stdout. You dump this to a file. These commands are executable. They can be SQL, or additional shell scripts (e.g. "delete-recoverable <customer-id>" vs. "delete-permanent <customer-id>").

The idea is you now have something to verify. You can scan it for errors. You can even put it up on Github for review by stakeholders. You double/triple check the output and then you execute it.

Tooling that enhances visibility by breaking down changes into verifiable commands is incredibly powerful. Making these tools idempotent is also an art form, and important.


That’s how I did one of my more impactful deduplication/deletion scripts. It had to reach across environments to do its work. But there was no way to send any flags to it to do stuff. The environment names were hard coded, so like dev-uw2 reaching out to stg-ue1. It would output a dry run result by default. And you could look and see what was going to get deleted and from what environment.

Because the names were hard coded, I had to get changes approved in GitHub. Then the script would run on Jenkins.

That script was also only for that purpose and nothing else. It made a mess because I needed a ton of functionality around creation and querying, too. I just copied the script to folders and modified them as needed but a better solution would’ve been to make a python module. I just liked the code itself being highly specific to what the script was doing to help reduce mistakes. If I’m running a script to delete repos, I need to go to the delete-repos directory.


If coding is theatrical then ops is operatic. You have to telegraph stuff so over the top that the people in the cheap seats know what’s going on.

I think what we’ve lost in the post-XP world is that just because you build something incrementally doesn’t mean it’s designed incrementally (read: myopically).

My idiot coworkers are “fixing” redundancy issues by adding caching, which recreates the same problem they’re (un?)knowingly trying to avoid, which is having to iterate over things twice to accomplish anything. They’ve just moved the conditional branches to the cache and added more.

Most of the time, and especially on a concurrent system, you are better off building a plan of action first and then executing it second. You can dedupe while assembling the plan (dynamic programming) and you don’t have to worry about weird eviction issues dropping you into a logic problem like an infinite loop.

More importantly, you can build the plan and then explain the plan. You can explain the plan without running it. You can abort the plan in the middle when you realize you’ve clicked the wrong button. And you can clean up on abort because the plan is not twelve levels deep in a recursive call, where trying to clean up will have bugs you don’t see in a Dev sandbox.

    Deleting 500 users…
Versus

    Permanently deleting 500 users…
Maybe with a nice 10 second pause (what’s an extra ten seconds for a task that takes five minutes?)


I suppose that’s why you don’t combine a tazer and gun into 1 device with 2 triggers.


If you have a 3rd trigger where the gun turns on the user it would be fairly safe.


the problem is that sometimes that gun looks like a taser.


Instead, you make it with one trigger and a PRNG that decides which gets activated. Just hope you've chosen the right PRNG!!


I will then write a script calls your script with the PRNG of my choice: PRNG1 always returns "trigger 2", and PRNG2 always returns "trigger 1". This detail will be documented in Confluence.


Considering American police can't even seem to get it right when they have two distinct firearms, and are trained to holster them on specific sides so they know what they are grabbing - and still manage to f*ck it up....this might be an improvement.


This speaks to a lack of operational excellence - when you develop a platform like JIRA, Confluence, etc, the operational tools required to manage the systems are just as important as the features themselves. If all you do is pump out features, you're a feature factory and will suffer these kinds of issues. There's no reasonable explanation for needing a script to do what was described when the necessary tooling to generalize such an operation should have been in existence.


Right? The way this reads it seems like one person set a flag incorrectly, something I'm sure we've all done numerous times. And there were no checks down the line to catch it.


Hi, this is Mike from Atlassian Engineering. You are right that the checks need to improve to reduce human error, but that's only half of it. I don't see this as human error though. It's a system error. We will be doing some work to make these kind of hard deletes impossible in our system.


The bigger story here is that they're restoring data that should have been permanently deleted for compliance reasons.


> Communication gap. First, there was a communication gap between the team that requested the deactivation and the team that ran the deactivation. Instead of providing the IDs of the intended app being marked for deactivation, the team provided the IDs of the entire cloud site where the apps were to be deactivated.

So what they are saying is that they are not testing scripts at some staging server before running them in production. It's wild that they've managed to scale their products so much before something like this happened.

I hope they've learnt their lesson and they set up some QA process for that stuff.


it seems that it worked as intended, thus they have a QA process. The problem was in the wrong IDs provided and I doubt that at their scale they have a staging environment that duplicates the customer data.


> I doubt that at their scale they have a staging environment that duplicates the customer data.

If there is no feasible way of replicating their production environment somewhere else, then there should be some sanity checks in place. Something like "if an abnormally high amount of customer sites go down during the script's execution, kill the script". This is a 20/20 hindsight approach though and if Atlassian engineers can't solve I doubt a random HN user like me can.


Would it be bad practice to append values to a GUID type of ID that would help a human recognize them? For instance, in this specific case they wanted app IDs as APP-XXXXX-XXXX-blahblah and CLOUD-XXXXX-blahblah.

I'm not looking to help their specific problems, but this is more from a general question I've thought of doing but never have done just because I'm sure I'd get laughed at for blazing my own trail


While we don't do exactly that, when pulling out lists of ID's like that for someone else, internal or external, we strive to include a description column as well.

This might be customer id and customer name, article number and article description, invoice id and invoice number etc.

Then it is usually very clear to the recipient what they've been handed.

Also, for internal autoinc-type id's, we mostly use sequence generators with non-overlapping "series". That is, we'll start first one at 1 million, second at 2 million or similar. Not perfect but can be useful.


This is recommended in my experience, but you do have some potential issues when a UUID gets reused or repurposed.

WHENEVER a human is involved in the chain, UUIDs can be suspicious because there's no easy way to verify what it is, whereas a human has a good chance of realizing that $1,342.34 is probably not a valid date.


I kind of dig it. Something that helps make things obvious to a human


--dry-run and verify the output


Highlighting the text in any of their lists breaks the page in interesting ways, apparently due to some twitter-sharing functionality.


Is it just me or is highlighting on that site broken,

Perhaps my ad blocker is causing that stupid highlight to tweet js they are using to break.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: