Weakest part of CI/CD workflows is when code requires corresponding infrastructure change, such as creation or modification of S3 bucket, IAM permission or pubsub creation. Worst part is when such changes require carefull orchestration.
I am yet to see a development workflow, capable of developing and releasing something as simple as no dowmtime migration of API to another load balancer.
Well that's because it's not a development workflow. That's a live systems change. It's like changing the tire on a moving vehicle.
If you have a very repeatable environment, you can have an entire pipeline that creates new infra from scratch (w/Terraform), build and deploy your new app, test it, and then point traffic at the new infra. It's like blue/green but bigger. You aren't changing the tire, you're moving from one moving vehicle to another one. That works well because there's no chance for unusual problems from trying to figure out how to re-jigger things on the fly.
The former is configuration-management-organized infrastructure, and the latter is immutable infrastructure.
The problem comes in with things like changing an S3 bucket or IAM role. Changing those things is like changing the highway... you can't replace the highway. You have to close down a lane of traffic, put up traffic cones, reduce the speed limit, make your changes carefully. Ideally test on a strip of test highway first.
These cloud-managed services cannot be made immutable, so you have to use configuration-management. So you have to have a change management system in place, and tightly manage the dependency between your app and the change.
What exactly goes into PRs you've seen? How do they codify these steps, such that:
- Release steps can be tested while PR is being developed. That is executed somewhere, which doesn't interfere with production or ideally other developers ongoing PR should it go bad
- Upon merge production system safely transitioned from one state to another without human intervention.
Steps, for simplicity are:
- Deploy new version of the code behind new load balancer
- Make some requests through it to verify its workig
- Switch DNS to the new load balancer IP
- Wait for old load balancer new connections to die down and remove both LB and old version of the code behind it.
I agree, I've started to demerge those sorts of changes. Infra code moves slower (over the long term) than app code. So having seperate pipelines is IMO better.
Though on your final comment, I'd say doing something like a no down time api migration is actually pretty complicated if you're not doing it a lot.
I am yet to see a development workflow, capable of developing and releasing something as simple as no dowmtime migration of API to another load balancer.