I'm appalled at the way some people here receive an honest postmortem of a human...

t0mas88 · on Oct 18, 2020

The software sector needs a bit of aviation safety culture: 50 years ago the conclusion "pilot error" as the main cause was virtually banned from accident investigation. The new mindset is that any system or procedure where a single human error can cause an incident is a broken system. So the blame isn't on the human pressing the button, the problem is the button or procedure design being unsuitable. The result was a huge improvement in safety across the whole industry.

In software there is still a certain arrogance of quickly calling the user (or other software professional) stupid, thinking it can't happen to you. But in reality given enough time, everyone makes at least one stupid mistake, it's how humans work.

janoc · on Oct 18, 2020

It is not only that but also realizing that there is never a single cause to an accident or incident.

Even when it was a suicidal pilot flying the plane into a mountain on purpose. Someone had to supervise him (there are two crew members in the cockpit for a reason), someone gave him a medical, there is automation in the cockpit that could have at least caused an alarm, etc.

So even when the accident is ultimately caused by a pilot's actions, there is always a chain of events where if any of the segments were broken the accident wouldn't have happened.

While we can't prevent a bonkers pilot from crashing a plane, we could perhaps prevent a bonkers crew member from flying the plane in the first place.

Aka the Swiss cheese model. You don't want to let the holes to align.

This approach is widely used in accident investigations and not only in aviation. Most industrial accidents are investigated like this, trying to understand the entire chain of events in order that processes could be improved and the problem prevented in the future.

Oh and there is one more key part in aviation that isn't elsewhere. The goal of an accident or incident investigation IS NOT TO APPORTION BLAME. It is to learn from it. That's why pilots in airlines with a healthy safety culture are encouraged to report problems, unsafe practices, etc. and this is used to fix the process instead of firing people. Once you start to play the blame game, people won't report problems - and you are flying blind into a disaster sooner or later.

jonplackett · on Oct 18, 2020

It’s interesting that this is the exact opposite of how we think about crime and punishment. All criminals are like the pilot, just the person who did the action. But the reasons for them becoming criminals is a seldom taken into account. The emphasis is on blaming and punishing them rather than figuring out the cause and stopping it happening again.

cdaven · on Oct 20, 2020

The criminal has to take the punishment for his actions (and extenuating circumstances are taken into consideration), but at the same time, people, companies and society, have learned that we need protection and prevention.

So you could argue that there have been a lot of post-mortems through the ages, with great ideas thrown around on how to avoid crimes being committed (at least against me/us). It's not just about locking people up.

pc86 · on Oct 22, 2020

Most airplane crashes aren't intentional.

When's the last time someone accidentally committed armed robbery?

jonplackett · on Oct 22, 2020

Lots of people would probably rather not commit crime. Not everyone, but a lot of crime is people who see no other way. People society usually failed while they were children. It’s still all a big system. If you don’t change the cause you’ll continue to get the same results. Even someone actively committing crime like the armed robber you describe are still part of that. Motivation is just the affect of previous events. You can say this without condoning criminal acts.

fireant · on Oct 19, 2020

The difference is the intent. The criminal wants to do bad things while the pilot does not want anyone to get harmed.

cdaven · on Oct 20, 2020

Sure, intent is relevant, but the example was "a suicidal pilot flying the plane into a mountain on purpose". Isn't killing all those passengers a crime?

ckuehne · on Oct 20, 2020

Of course it is. That is why, had the pilot survived, he/she would have faced criminal charges.

jonplackett · on Oct 21, 2020

It’s not one or the other, especially if there’s intent. Yes punish, but don’t scapegoat the whole systemic problems on the individual and then think you solved the problem by creating a ‘deterrence’ to others. Life doesn’t work that way. Imagine if every serious crime prompted a review and action to stop it ever happening again - imagine how much further along we’d be to a more just society.

Talinx · on Oct 19, 2020

To stop the cause from happening is not always feasible. It might also be against human rights.

jonplackett · on Oct 21, 2020

Not always but a lot of crime seems like it could be avoided if more effort were put into prevention - better funding for education etc. It’s not rocket science.

Talinx · on Oct 21, 2020

I agree, there are things that can and should be done to prevent crime.

But good prevention is really hard, education being a good example. Today most people have internet access and you can educate yourself there (Wikipedia, Khan Academy, YouTube...). Access to education is not a problem - getting people to educate themselves is. Nerds do it on their own, many don't. It takes individual effort to get children's minds to learn. You need teachers who like teaching and can get children excited about the world. It's not as easy as giving everyone an iPad, funding without understanding the problems doesn't work. (I guess there are situations where funding easily solves problems depending on your country and school.)

(I'm not against funding education, I just wish it would happen in a smarter way.)

globular-toast · on Oct 18, 2020

There is sometimes a single cause, but as the parent comment pointed out, that should never be the case and is a flaw in the system. We are gradually working towards single errors being correctable, but we're not there yet.

On the railways in Britain the failures were extensively documented. Years ago it was possible for a single failure to cause a loss. But over the years the systems have been patched and if you look at more recent incidents it is always a multitude of factors aligning that cause the loss. Sometimes it's amazing how precisely these individual elements have to align, but it's just probability.

As demonstrated by the article here, we are still in the stage where single failures can cause a loss. But it's a bit different because there is no single universal body regulating every computer system.

janoc · on Oct 18, 2020

There is almost never a single cause. If a single cause can trigger a disaster, then there is another cause by definition - poor system design.

E.g. in the article's case it is clear that there is some sort of procedural deficiency there that allows the configuration variables to be set wrong and thus cause a connection to the wrong database.

Another one is that the function that has directly caused the data loss DOES NOT CHECK for this.

Yet another WTF is that if that code is meant to ever run on a development system, why is it in a production codebase in the first place?

And the worst bit? They throw arms up in the air, unable to identify the reason why this has happened. So they are leaving the possibility open to another similar mistake happening in the future, even though they have removed the offending code.

Oh and the fact that they don't have backups except for those of the hosting provider (which really shouldn't be relied on except as the last hail Mary solution!) is telling.

That's not a robust system design, especially if they are hosting customers' data.

JackFr · on Oct 18, 2020

This should be a teachable moment with respect to their culture. Throwing up their hands without an understanding of what happened is unacceptable — if something that is believed impossible happens, it is important to know where your mental model failed. Otherwise you may make things worse by ‘remediating’ the wrong thing.

And while this sounds overly simplistic the simplest way this could have been avoided is enforcing production hygiene. No developers on production boxes. Ever.

pc86 · on Oct 22, 2020

I was on a cross-country United flight ca. 2015 or so and happened to sitting right in the front of first class and got to see the pilots take a bathroom break (bear with me). The process was incredibly interesting.

1. With the flight deck door closed, the three flight attendants place a drink cart between first class and the attendant area/crew bathroom. There's now a ~4.5' barrier locked against the frame of the plane.

2. The flight deck door is opened; one flight attendant goes into the flight deck while one pilot uses the restroom. The flight deck door is left open but the attendant is standing right next to it (but facing the lone pilot). The other two attendants stand against the drink cart, one facing the passengers and one facing the flight deck.

3. Pilots switch while the third attendant remains on the flight deck.

4. After both pilots are done, the flight deck door is closed and locked and the drink cart is returned to where ever they store it.

Any action by a passenger would cause the flight deck door to be closed and locked. Any action by the lone pilot would cause alarm by the flight deck attendant. Any action by the flight deck attendant would cause alarm by the other two.

quietbritishjim · on Oct 18, 2020

> Even when it was a suicidal pilot flying the plane into a mountain on purpose. Someone had to supervise him (there are two crew members in the cockpit for a reason), someone gave him a medical, there is automation in the cockpit that could have at least caused an alarm, etc.

There was indeed a suicidal pilot that flew into a mountain, I'm not sure if you were deliberately referencing that specific time. In that case he was alone in the cabin – this would have happened briefly but he was able to lock the cabin door before anyone re-entered, and the lock cannot be opened by anyone from the other side in order to avoid September 11th-type situations. It only locks for a brief period but it can be reapplied from the pilot side before it expires an indefinite number of times.

I'm not saying that we can put that one down purely to human action, just that (to be pedantic) he wasn't being supervised by anyone, and there were already any number of alarms going off (and the frantic copilots on the other side of the door were well aware of them).

t0mas88 · on Oct 18, 2020

And as a result of that incident the procedures have changed, now a cabin crew member (or relief pilot in long haul ops) joins the other pilot in the cockpit if one has to go to the bathroom.

A similar procedure already exists for controlled rest in oceanic cruise flight at certain times, using the cabin crew to ensure the remaining pilot was checked to be awake every 20 minutes.

janoc · on Oct 18, 2020

I was referring specifically to the Germanwings incident.

That pilot shouldn't have been in the cockpit to begin with - his eyesight was failing, he had mental problems (has been medically treated for suicidal tendencies), etc. This was not discovered nor identified, due to deficiencies in the system (doctors didn't have the duty to report this, he withheld the information from his employer, etc.)

The issue with the door was only the last element of the chain.

There were changes as the result of this incident - the cabin crew member has to be in the cockpit whenever one of the pilots steps out, there were changes to how the doors operate, etc.

confidantlake · on Oct 18, 2020

The change to require a cabin crew member in the cockpit is a good one.

Not really sure what you can about the suicidal tendencies. If you make pilots report medical treatment for suicidal tendencies, they aren't going to seek treatment for suicidal tendencies.

janoc · on Oct 19, 2020

That should have been reported by the doctor. Lubitz (the pilot) was denied an American license for this before - and somehow it wasn't caught/discovered when he got the Lufthansa/Germanwings job. Or nobody has followed up on it.

On the day of the crash he was not supposed to be on the plane at all - a paper from the doctors was found at his place after the crash declaring him unfit for duty. He kept it from his employer and it wasn't reported by the doctors neither (they didn't have the duty to do so), so the airline had no idea. Making a few of the holes in the cheese align nicely.

Pilots have the obligation to report when they are unfit for duty already, (no matter what the reason, being treated for a psychiatric problem certainly applies, though).

What was/is missing is the obligation of doctors to report such important issue to the employer when the crewman is unfit. It could be argued that it would be an invasion of privacy but there are precedents for this - e.g. failed medicals are routinely being reported to the authorities (not just for pilots - also for car drivers, gun holders, etc. where the corresponding licenses are then suspended), as are discoveries of e.g. child abuse.

vlovich123 · on Oct 18, 2020

Any examinations of whether or not the job itself has properties that might cause the medical issues?

odyssey7 · on Oct 18, 2020

My impression of the Swiss cheese model is that it's used to take liability from the software vendor and (optionally) put it back on the software purchaser. Sure, there was a software error, but really, Mr. Customer, if this was so important, then you really should have been paying more attention and noticed the data issues sooner.

janoc · on Oct 19, 2020

Nonsense.

Software vendor cannot be held responsible for errors committed by the user.

That would be blaming a parachute maker for the death of the guy who jumped out of a plane without a parachute or with one rigged wrong despite the explicit instructions (or industrial best practices) telling him not to do so.

Certainly vendors need to make sure that their product is fit for the purpose and doesn't contain glaring design problems (e.g. the infamous Therac-25 scandal) but that alone is not enough to prevent a disaster.

For example, in the cited article there was no "software error". The data haven't been lost because of a bug in some 3rd party code.

Data security and safety is always a process, there is no magic bullet you can buy and be done with it, with no effort of your own.

The swiss cheese model shows this - some of the cheese layers are safeguards put in place by the vendor, the others are there for you to put in place (e.g. the various best practices, safe work procedures, backups, etc.) If you don't, well, you are making the holes easier to align because there are now fewer safety layers between you and the disaster. By your own choice.

gonzo41 · on Oct 18, 2020

You can't outsource risk.

oblio · on Oct 18, 2020

The user? Start a discussion about using better programming language and you'll see people, even here, blaming the developer.

The common example is C: "C is a sharp tool, but with a sufficiently smart, careful and experienced developer it does what you want (you're holding it wrong").

Developers still do this to each other.

m463 · on Oct 18, 2020

That reminds me of the time during the rise of the PC when windows would do something wrong, from a confusing interface all the way up to a blue screen of death.

What happened is that users started blaming themselves for what was going wrong, or start thinking they needed a new PC because problems would become more frequent.

From the perspective of a software guy, it was obvious that windows was the culprit but people would assign blame elsewhere and frequently point the finger at themselves.

so yes - an FAA investigation would end up unraveling the nonsense and point to windows.

That said, aviation level of safety is reliable and dependable and few single points of failure and... there are no private kit jets darnit!

There is a continuum from nothing changes & everything works to everything changes & nothing works. You have to choose the appropriate place on the dial for the task. Sounds like this is a one-man band.

watwut · on Oct 18, 2020

Yeah, but "pilot was drinking alcohol" would be considerate issue, would lead to fired pilot and would lead to more alcohol testing.

I understand what you are taking about, but aviation has also strong expectations on pilots.

janoc · on Oct 18, 2020

Of course it would. But then there should be a process that identifies such pilot before they even get on the plane, there are two crew in the cockpit, so if one crewman does something unsafe or inappropriate, the other person is there to notice it, call it out and, in the extreme case, to take control of the plane.

Also, if the guy or gal has alcohol problems, it would likely be visible on their flying performance over time, it should be noticed during the periodic medicals, etc.

So while a drunk pilot could be the immediate cause of a crash, it is not the only one. If any of those other things I have mentioned functioned as designed (or were in place to start with - not all flying is airline flying!), the accident wouldn't have happened.

If you focus only on the "drunk pilot, case closed", you will never identify deficiencies you may have elsewhere and which have contributed to the problem.

watwut · on Oct 18, 2020

Note how none of those is in article. Article is like "Could we blame alcohol? Oh no, surely not.".

t0mas88 · on Oct 18, 2020

Believe it or not, even "pilot is an alcoholic" is still part of the no blame culture in aviation. As long as the pilot reports himself he'll not be fired for that. Look up the HIMS program to read more details.

user5994461 · on Oct 18, 2020

You can google and find cases of US pilots getting fired and sentenced to a year in prison for flying intoxicated.

Maybe they don't get fired if they report themselves unable to fly beforehand but I wouldn't quite call that a no blame culture.

neillyons · on Oct 18, 2020

This sounds quite interesting. Any books you could recommend on the "pilot error" topic.

miketery · on Oct 18, 2020

When I was getting my pilots license I used to read accident reports from Canada's Transportation Safety Board [1]. I'm sure the NTSB (America's version) has similar calibre reports [2].

There is also Cockpit Resource Management [3] which addresses the human factor in great detail (how people work with each other, and how prepared are people).

In general what you learn from reading these things is that its rarely one big error or issue - but many small things leading to the failure event.

1 - https://www.tsb.gc.ca/eng/rapports-reports/aviation/index.ht...

2 - https://www.ntsb.gov/investigations/AccidentReports/Pages/Ac...

3 - https://en.wikipedia.org/wiki/Crew_resource_management

masklinn · on Oct 18, 2020

The old "they write the right stuff" essay on the On-Board Shuttle Group also talked about this mindset of errors getting through the process as being first and foremost a problem with the process to be examined in detail and fixed.

Jugurtha · on Oct 18, 2020

"The Checklist Manifesto", by Atul Gawande, dives into how they looked at other sectors such as aviation to improve healthcare systems, reduce infections, etc. Interesting book.

neillyons · on Oct 18, 2020

Just bought the audiobook. About to give it a listen now. Thanks.

permarad · on Oct 18, 2020

The Design of Everyday Things by Donald A. Norman. He covers pilot error a lot in this book in how it falls back on design and usability. Very interesting read.

jsmith45 · on Oct 19, 2020

Not sure about books, but the NTSB generally seems to adopt the philosophy of not trying to assign blame, but instead to figure out what happened, and try to determine what can be changed to prevent this same issue from happening again.

Of course trying to assign blame is human nature, so the reports are not always completely neutral. When I read the actual NTSB report for Sullenburger's "Miracle on the Hudson", I was forced to conclude that while there were some things that the pilots could in theory have done better, given the pilots training and documented procedures, they honestly did better than could reasonably be expected. I am nearly certain that some of the wording in the report was carefully chosen to lead one to this conclusion, despite still pointing out the places where the pilots actions were suboptimal (and thus appearing facially neutral).

The "what can we do to avoid this ever happing again?" attitude applies to real air transit accident reports. Sadly many general aviation accident reports really do just become "pilot error".

benkelaar · on Oct 18, 2020

This is also one of the core tenets of SRE. The chapter on blameless postmortems is quite nice: https://landing.google.com/sre/sre-book/chapters/postmortem-...

jnsaff2 · on Oct 18, 2020

Anything by Sidney Dekker. https://sidneydekker.com/books/ I would start by The Field Guide to Unterstanding 'Human Error'. It's very approachable and gives you a solid understanding of the field.

janoc · on Oct 18, 2020

Not sure about books but look up the Swiss cheese model. It is widely used approach and not only in aviation. Most industrial accidents and incidents are investigated with this in mind.

AdrianB1 · on Oct 18, 2020

As a GA pilot I know people that had accidents with planes and I know that in most cases what is the the official report and what really happened are not the same, so any book would have to rely on inaccurate or unreal data. For airliners it is easy because there are flight recorders, for GA it is still a bit of Wild West.

t0mas88 · on Oct 18, 2020

It's part of the Human Performance subject in getting an ATPL (airline license), it was one of the subjects that I didn't hate as much when studying. You can probably just buy the book on Amazon, they're quite accessible.

ENOTTY · on Oct 18, 2020

The idea that multiple failures must occur for catastrophic failure is found in certain parts of the computing community. https://en.wikipedia.org/wiki/Defense_in_depth_(computing)

suzakus · on Oct 18, 2020

It's a piece of software for scoreboards. Not the Therac-25, nor an airplane.

qz2 · on Oct 18, 2020

This time.

Some days it’s just an on line community that gets burned to the ground.

Other days it’s just a service tied into hundred of small businesses that gets burned to the ground.

Other says it’s massive financial platform getting burned to the ground.

I’m responsible for the latter but the former two have had a much larger impact for many people when they occur. Trivialising the lax administrative discipline because a product isn’t deemed important is a slippery slope.

We need to start building competence in to what we do regardless of what it is rather than run on apologies because it’s cheaper.

wizzwizz4 · on Oct 18, 2020

Prismo is an example of the first: https://fediverse.blog/~/Prismo/on-prismo-data-loss

The project never recovered.

ozim · on Oct 18, 2020

Parent is not advocating about going as strict with procedures as operating an airplane. Post is saying about "a bit of aviation safety culture" then it highlights a specific part that would be useful.

Safety culture element highlighted is: not blaming a single person but finding out how to prevent accident that happened from happening again. Which is reasonable because you don't want to impose some strict rules that are expensive up front. This way you just introduce measure to prevent same thing in the future, in the context of your project.

suzakus · on Oct 18, 2020

Ah, I misread. That's what I get for commenting late at night :(

Thanks for clarifying!

t0mas88 · on Oct 18, 2020

It isn't about the importance of this one database, it's about the cultural issue in most of the sector that the parent comment was pointing out: we far too often blame the user/operator calling them stupid, while every human makes mistake, it's inevitable.

dmitriid · on Oct 18, 2020

The Russian joke about investigating is "Punish the innocent, award the uninvolved"

ganafagol · on Oct 18, 2020

It's good to have a post mortem. But this was not actually a post mortem. They still don't know how it could happen. Essentially, how can they write "We’re too tired to figure it out right now." and right after attempt to answer "What have we learned? Why won’t this happen again?" Well obviously you have not learned the key lesson yet since you don't know what it is! And how can you even dream of claiming to guarantee that it won't happen again before you know the root cause?

Get some sleep, do a thorough investigation, and the results of that are the post mortem that we would like published and where you learn from.

Publishing some premature thoughts without actual insight is not helping anybody. It will just invite the hate that you are seeing in this thread.

ordu · on Oct 18, 2020

> I'm appalled at the way some people here receive an honest postmortem of a human fuck-up. The top 3 comments, as I write this, can be summarized as "no, it's your fault and you're stupid for making the fault".

It seems that people annoyed mostly by "complexity gremlins". They are so annoyed, that they miss previous sentence "we’re too tired to figure it out right now." Guys fucked up their system, they restored it the best they could, they tried to figure out what happened, but failed. So they decided to do PR right now, to explain what they know, and to continue the investigation later.

But people see just "complexity gremlins". The lesson learned is do not try any humor in a postmortem. Be as serious, grave, and dull as you can.

rawgabbit · on Oct 18, 2020

For me, this is an example of DevOps being carried too far.

What is to stop developers for checking into Github "drop database; drop table; alter index; create table; create database; alter permission;"? They are automating environment builds and so that is more efficient right? In my career, I have seen a Fortune 100 company's core system down and out for a week because of hubris like this. In large companies, data flows downstream from a core system. When you have to restore from backup, that cascades into restores in all the child systems.

Similarly, I once had to convince a Microsoft Evangelist who was hired into my company, not to redeploy our production database, every-time we had a production deployment. He was a pure developer and did not see any problems of dropping the database, recreating the database, and re-inserting all the data. I argued that a) this would take 10+ hours b) the production database has data going back many years and that the schema/keys/rules/triggers have evolved during that time-- meaning that many of the inserts would fail because they didn't meet the current schema. He was unconvinced but luckily my bosses overruled him.

My bosses were business types and understood accounting. In accounting, once you "post" a transaction to the ledger that becomes permanent. If you need to correct that transaction, then you create a new one that "credits" or corrects the entry. You don't take out the eraser.

bromuro · on Oct 18, 2020

I think you should wait 10+ hours to read different kind of comments on HN.

For example, if i open the comments about a “14 hours ago” post, I usually see a top comment about other comments (like yours).

I then feel so out of the loop because i don’t see the “commenters” your are referring too - so the thread that follows seem off topic to me.

caspii · on Oct 18, 2020

Thanks

qz2 · on Oct 18, 2020

I disagree.

Culturally speaking we like to pat people on their back when they do something stupid and comfort them. But most of the time this isn’t productive because it doesn’t instil the requisite fear required when working out what decision to make.

What happens is we have growing complacency and disassociation from consequences.

Do you press the button on something potentially destructive because your are confident it is ok through analysis, good design and testing or confidence it is ok through trite complacency?

The industry is mostly the latter and it has to stop. And the first thing is calling bad processes, bad software and stupidity out for what it is.

Honestly these guys did good but most will try and hide this sort of fuck up or explain it away with weasel words.

jurre · on Oct 18, 2020

You should have zero fear instilled when pressing any button. The system or process has failed if a single person pressing a button can bring something down unintended. Fix the system/process, don’t “instill fear” onto the person, it’s toxic, plus now you have to make sure any new person on boarded has “fear instilled”, and that’s just hugely unproductive

qz2 · on Oct 18, 2020

That’s precisely my point. A lot of people have no fear because they’re complacent or ignorant rather than the button is well engineered.

But to get there you need to fear the bad outcomes.

ddelt · on Oct 18, 2020

I’m sorry, but this really hasn’t been my experience at all in web technology or managing on-prem systems either.

I used to be extremely fearful of making mistakes, and used to work in a zero-tolerance fear culture. My experience and the experience of my teammates on the DevOps team? We did everything manually and slowly because we couldn’t see past our own fear to think creatively on how to automate away errors. And yes, we still made errors doing everything carefully, with a second set of eyes, late night deployments, backups taken, etc.

Once our leadership at the time changed to espouse a culture of learning from a mistake and not being afraid to make one as long as you can recover and improve something, we reduced a lot of risk and actually automated away a lot of errors we typically made which were caused initially by fear culture.

Just my two cents.

qz2 · on Oct 18, 2020

I’m not talking about fear culture. It’s a more personal thing. Call it risk management culture if that helps which is the inverse.

Manual is wrong for a start. That would increase the probability of making a mistake and thus increase risk for example. The mitigations are automation, review and testing.

I agree with you. Perhaps fear was the wrong term. I treat it as my personal guide to how uneasy i feel about something on the quality front.

inglor_cz · on Oct 18, 2020

I recalled Akimov pressing the AZ-5 button in Chernobyl...

ClumsyPilot · on Oct 18, 2020

"fear required when working out what decision to make"

People like you keep making the same mistake, creating companies/organisations/industries/societies that run on fear of failure. We've tried it a thousand times, and it never works.

You can't solve crime by making all punishments hearsh death, we've tried that in 1700 in Britain and crimerate was sky high.

This culture gave us disasters in USSR and famine in China.

The only thing that can solve this problem is structural change.

qz2 · on Oct 18, 2020

I think my point is probably being misunderstood and that is my fault for explaining it poorly. See I fucked up :)

The fear I speak of is a personal barrier which is lacking in a lot of people. They can sleep quite happily at night knowing they did a shitty job and it's going to explode down the line. It's not their problem. They don't care.

I can't do that. Even if there are no direct consequences for me.

This is not because I am threatened but because I have some personal and professionals standards.

mrmonkeyman · on Oct 18, 2020

HN likes to downplay this, apparantly, but not everything can be boiled down to bureaucracy.

Yes, medical professionals use checklists. They also have a harsh and very unforgiving culture that fosters craftsmanship and values professionalism above all else. You see this in other high-stakes professions too.

You cannot just take the checklist and ignore the relentless focus on quality, the feelings of personal failure and countless hours honing and caring for the craft.

Developers are notorious for being lazy AF, so it's not hard to explain our obsession with "just fix the system". It's a required but not sufficient condition.

ClumsyPilot · on Oct 18, 2020

'The system' includes the attitudes of developers and people that pay them.

Everyone takes job of a medical proffesional seriously, from the education to the hospitals that enmloys them to the lawmakers to the patients.

When you pick a surgeon, you avoid the ones that killed people. Do you avoid developers that introduce bugs? We don't even keep track of that!

You can have the license taken away as a surgeon, I've never heard of anyone becoming unemployable as a developer.

You are not gonna get an equivalent outcome even if tomorrow all developers show up to work with an attitude of a heart surgeon.

However if suddenly all data loss and data breaches would result in massive compensation, and if slow and buggy software resulted in real lawsuits, you would see the results very quickly.

Basically same issues as in trading securities: no accountability for either developers or the decision makers.

mr_toad · on Oct 19, 2020

> medical professionals

operate in an environment where they don’t fully understand the systems they’re working with (human biology still has many unknowns), and many mistakes are irreversible.

If you look at the worst performing IT departments, they suffer from the same problems: they don’t fully understand how their systems work, and they lack easy ways to reverse changes.

johnisgood · on Oct 18, 2020

> The only thing that can solve this problem is structural change.

Well, care to elaborate on this? What do we have to change, and to what end?

MaxBarraclough · on Oct 18, 2020

You're speaking to the mistake. The comment you're replying to is speaking to the write-up analysing the mistake.

Blog posts analysing real-world mistakes should not be met with beratement.

corobo · on Oct 18, 2020

Most will hide it away because being truthful will hurt current business or future career prospects because people like yourself exist who want everyone shitting themselves at the prospect of being honest.

In a blame free environment you find the underlying issue and fix it. In a blame full environment you cover up the mistake to avoid being fired and some other person does it again later down the line

qz2 · on Oct 18, 2020

No.

There’s a third option where people accept responsibility and are rewarded for that rather than hide from it one way or another.

I have a modicum of respect for people who do that. I don’t for people who shovel it under a rug or point a finger which are the points you are referring to. I’ve been in both environments and neither end up with a positive outcome.

If I fuck up I’m the first person to put my hand up. Call it responsibility culture.

rawoke083600 · on Oct 18, 2020

I think you missing point, I love the idea about "bringing aviation methodology" to lower error/f-up-rates" for the software industry.

No one is not saying don't take responsibility, they are saying - as I understood it:

Have a "systematic-approach" to the problem, the current system for preventing "drunk pilots or the wiping of production db's are not sufficient" - improve the system ! ! All the "taking responsibilities and "falling on one's own sword" won't improve the process for the future.

If we take the example of the Space-Industry where having 3x Backups Systems are common (like life support)

It seems some people's view in the comments stream is:

"No bugger that, the life-support systems engineers and co should just 'take responsibility' and produce flawless products. No need for this 3 x backups systems"

The "system" approach is that there is x-rates of failures by having 2 backups we have now reduce the possibility of error by y amount.

Or in the case of production-dbs:

If I were the CEO and the following happens:

CTO-1: "Worker A, has deleted the production DB, I've scolded him, he is sorry and got dock a months pay and is feeling quite bad and has taken full responsibility for his stupid action this probably won't happen again !"

VS

CTO-2: "Worker A, has deleted the production DB, We Identified that our process/system for allowing dev-machines to access production db's was a terrible idea and oversight, we now have measures abc in place to prevent that in the future"

I'd go with CTO-2 EVERY day of the week !

qz2 · on Oct 18, 2020

Yes. CTO-2 is my approach. As the CTO I fucked up because I didn't have that process in place to start with. To buck stops at me.

CTO-2 also has the responsibility of making sure that everyone is educated on this issue and can communicate those issues and worries (fears) to his/her level effectively because prevention is better than cure. Which is my other point.

corobo · on Oct 18, 2020

See that leading with a "No." there

That's what we're talking about. I hope you don't have direct reports.

Next time be honest "Just shut the conversation down, everyone's a dumbass, I'm right, you're dumb" it'll be quicker than all this back and forth actually trying to get to a stable middle ground :)

qz2 · on Oct 18, 2020

That's a tad ironic is it not?

All I am calling for is people to take responsibility.

rawoke083600 · on Oct 18, 2020

What is the future value in that ? From a system and reliability point of view ? Genuine question - not trying to be a dk

qz2 · on Oct 18, 2020

The point is that if you take responsibility then you're taking pride in your work, are invested in it and willing to invest in self-improvement and introspection rather than doing the minimum to get to a minimum viable solution. The outcome of this is an increase in quality and a decrease in risk.

rawoke083600 · on Oct 18, 2020

Wow - "...invest in self-improvement and introspection.."

I would hate for that to be our system reliability improvement methodology.

Ok fine now I'm being slightly a "dk" but really ?

qz2 · on Oct 18, 2020

Well you can also apply a QMS if you want but all that does is generate paperwork full of accepted risks...

hightekredneck · on Oct 23, 2020

Exactly right.

The "comfort" will come from taking responsibility and owning and correcting the problem such that you have confidence it won't happen again.

Platitudes to make someone feel better without action helps nobody.

The fear is a valuable feedback mechanism and shouldn't been ignored. It's there to remind you of the potential consequences of a careless fuckup.

Lots here misunderstood this I think.. clearly the point is not to berate people for making mistakes or to foster a "fear culture" insofar as fear of personal attack but rather to not ignore the internal/personal fear of fucking up because you give a shit.