You know what would be an interesting service: Chaos monkey/failure injection-as...

0vermorrow · on July 23, 2019

You mean like https://www.gremlin.com/ ?

One of the founders of Gremlin is an Engineer that worked in Netflix and probably worked on Chaos Monkey as well :)

adlleong · on July 23, 2019

If I understand correctly, one of the limitations of doing application level failure injection with Gremlin is that you need to integrate it into your code: https://www.gremlin.com/docs/application-layer/installation/

It might be interesting to combine these approaches and use a traffic split to send a percentage of traffic to Gremlin instead of integrating into the code directly.

barbecue_sauce · on July 23, 2019

Shh, you're going to wake up the TinkerPop Gremlin guy.

jrockway · on July 24, 2019

These kinds of errors are going to happen in production, whether you inject them or let them occur naturally. Any release process that doesn't go perfectly with a drain / rebalance / start new version / rebalance per (backend, proxy) combo is going to have a timeout or broken connection between the proxy and the backend as it restarts. Should you return 502 to your users when that happens? Nope, just retry on a different backend. This lets you test that.