Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I agree with a lot of the statements at the top of the article, but some of them are just nonsense. This one, in particular:

> If we analyse accidents more deeply, we can get by analysing fewer accidents and still learn more.

Yeah, that's not how it works. The failure modes of your system might be concentrated in one particularly brittle area, but you really need as much breadth as you can get: the bullets are always fired at the entire plane.

> An accident happens when a system in a hazardous state encounters unfavourable environmental conditions. We cannot control environmental conditions, so we need to prevent hazards.

I mean, I'm an R&D guy, so my experience is biased, but... sometimes the system is just broke and no amount of saying "the system is in a hazardous state" can paper over the fact that you shipped (or, best-case, stress-tested) trash. You absolutely have to run these cases through the failure analysis pipeline, there's no out there, but the analysis flow looks a bit different for things that should-have worked versus things that could-never-have worked. And, yes, it will roll up on management, but... still.



> you really need as much breadth as you can get

Sure, more is always better. Practically, though, we are trading depth for breadth. In my experience, many problems that look dissimilar after a shallow analysis turn out to be caused by the same thing when analysed in depth. In that case, it is more economical to analyse fewer incidents in greater depth and actually find their common factors, rather than make a shallow pass over many incidents and continue to paper over symptoms of the undiscovered deeper problem.


I guess that's where our experience differs. I am 100% on board with you for chasing as many incidents to true root cause as possible and I agree that doing so can be extremely helpful in ways that might not be easy to foresee.

But my experience is also that you cannot ignore anything. Even the little stuff. The number of difficult system-level bugs I have resolved by remembering "you know, two weeks ago, it briefly did this weird thing that really shouldn't have been possible, but this might be related to that if only..." is crazy. It's been a superpower for me through the years.

However, I mostly work on hardware. Hardware's complexity envelope is straight-up different to software's. So that might explain some of our different perspective. Hardware absolutely never randomly misbehaves (which is to say that all its bad behavior has some kind of cause one might reasonably be able to ascertain, and that cause isn't from a level unreachable for a hardware engineer), but software carries enough state, and state from other levels of the stack, that I would not make the same statement. Thus the fault-chasing priorities aren't quite the same.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: