Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

After isolating the bug, we attempted to roll back to a previous version of the routing mesh code. While the rollback solved the initial problem, there as an unexpected incompatibility between the routing mesh and our caching service.

To me, it seems like they just needed to apply the "Hot patch", instead they panicked(?) and did a lot of unnecessary version control gymnastics, which delayed the bug fix.



I've written mostly client code, and watched server action from the sidelines, but jumping straight to the hotfix only seems obvious in retrospect to me. Rolling back to a known-good state is the safe approach – it just didn't work in this case because of a surprise incompatibility with another system.

If you jump straight to the hotfix, you're basically enlisting the entirety of your userbase to join you in a round of QA, which could be sub-optimal if your hotfix ends up causing some other unintended consequence.

Right?


Rolling back to a known-good state is the safe approach

Absolutely.

However, they rollback should be atomic, which means all pieces of the infrastructure/code should be rolled back to a known-good-state.

When I said "gymnastics", I meant for rolling back one piece of code, only to find incompatibility with other pieces.

I do not intend to judge them, knowing its difficult not to panic in the difficult emergency situation, but not working effect out on paper (or not knowing, which versions of software components are inter-compatible), before actually doing it on code, look pretty novice to me, for a company of heroku's scale.

I really hope this is not an unfair criticism. (handling emergencies are difficult)


The trouble is that they have messaging servers, a routing mesh, and caching servers which are all loosely coupled and deployed on separate boxes. They could take down all dozens of pieces of their infrastructure and roll them all back to wherever they were on that previous date, but this is not better for several reasons:

1) it take much longer than just rebooting the isolated service. Can you imagine Google shutting down every one of their multi-million boxes, rolling them back to a previous state and spinning them up again?

2) they'd still be at risk for incompatibilities with their databases, etc. the problem with unexpected imcomptabilities is that they're unexpected.


All I am trying to say, it only makes sense to do/know effect analysis of your changes, before actually doing it.

I fail to see, why they need to revert a piece of code, and then realize, OMG... this version of code does not fit well with the rest of architecture, now change it back.

1.) I do not expect Google (or even a small shop, like my place) to revert any piece of code which is not affected.

I, HOWEVER, expect to know what changes I am EXACTLY doing, and what to EXPECT after the changes.

(It should not be black magic, for historical code).

2.) I fail to understand this, why should this the case for older code? I can understand some tricky/edge/minor cases, but whether the architecture/database etc. (major compatibility) is compatible or not, should be possible to calculate BEFORE doing the changes.

I hope I am not over-trivializing the issue, but I still cannot get my head over the approach.


To me, it sounds like they rolled one software component back to a known working version.

That's not unreasonable. And doesn't sound like "a lot of unnecessary version control gymnastics" to me.

Of course, the problem was that the version they rolled back to was incompatible with the current version of their other services.


How is rolling back one of software component, without knowing/calculating whether it will "play nice" with the rest of the architecture/services is acceptable/reasonable?

Am I missing something here, please let me know?


Since what they had was already broken and customers were down, they probably didn't have the hours it would take to correctly verify the older version of the broken component would work with the rest of their system.


They ended up reverting and undoing revert due to non-verification. Also, paper calculations to be certain about the effects, are not supposed to take a lot of time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: