Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster emergency rollout by skipping entire canary process #1681

Open
kangyawong-grabtaxi opened this issue Jul 15, 2024 · 0 comments
Open

Comments

@kangyawong-grabtaxi
Copy link

kangyawong-grabtaxi commented Jul 15, 2024

Describe the feature

What problem are you trying to solve?

Currently, we can use skipAnalysis to bypass most of the canary steps. However, in some cases, the canary process can still be slow, especially when the time to pass the probe checks (time to ready) is longer than usual.

For instance, I have a few apps that require more than 10 minutes to pass the probe checks, and it's not feasible to refactor it in the foreseeable future.

While skipping the probe check might seem risky, it can be particularly useful for emergency roll-forwards when users are confident that the target spec works and they want to replace the buggy pods as fast as possible. Therefore, it would be beneficial to have a way to skip the entire canary process during emergencies without directly patching the primary objects.

Proposed solution

I wonder if it's possible to have a new boolean flag in the canary CRD to skip both the canary rollout status check and analysis. Here's a POC

A few drawbacks included:

  • Skipping the compatibility check between the configmap/secrets (using configMapKeyRef) and the application - existing primary pods may break if they are reactive to the mounted config file changes and the config is not compatible
  • In a way, the proposed flag is similar to the K8s rolling update, but without the need to uninstall Flagger to skip the canary

Any alternatives you've considered?

  1. Uninstalling or disabling Flagger might work, but it seems excessive for emergency deployments.
  2. A script to manually patch the primary objects - it's somewhat hacky and error prone IMO because I prefer Flagger for normal deployment flow
  3. canaryReadyThreshold check will be blocked by canary rolling update event (when replacing old replicaset with the new ones)

I'd be happy to discuss any other workarounds or suggestions 🙇‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant