Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCL not updated on specific circumstances #138

Open
robert7 opened this issue Dec 13, 2022 · 4 comments
Open

VCL not updated on specific circumstances #138

robert7 opened this issue Dec 13, 2022 · 4 comments
Labels
bug Something isn't working help-wanted Extra attention is needed

Comments

@robert7
Copy link
Contributor

robert7 commented Dec 13, 2022

Describe the bug
Already a longer time, I had a suspicion, that sometimes the VCL in the cluster is not updated.
Now I could reproduce (or better say: spot and document occurence).
It seems to be as following:

  • watch loop picks the change (after config map with VCL in the cluster has been updated and via mapped volume is updated in the running pod)
  • change is set to internal data structure of the controller
  • watch.go rebuildConfig() tries to set the rendered VCL to the varnish via CLI
  • something fails and rebuildConfig() returns error - error is logged (as warning)
  • as the new VCL is already updated in the data structure of the controller, it will never be retried and except for the present log message silently ignored - cluster pod will further run with the old config

Attachment screenshots.

To Reproduce
Unfortunately, I can't reproduce, as it happens semi randomly. Probably it could be reproduced by applying some stress to the varnish so the "varnish child is terminated/dies during VCL update".

Expected behavior
VCL is updated (or at least retried few times on fail)

Environment:

  • Kubernetes version: 1.24
  • kube-httpcache version: master branch

Configuration
-varnish-vcl-template=/etc/varnish/tmpl/frontend-tmpl.vcl # mapped from config map
-varnish-vcl-template-poll=true

Additional context
I did further analysis. In my case, it seems the following happens:

  • varnish daemon gets request via CLI
  • varnish master process can't communicate with its child process and terminates the child process and restarts it - note that this is one specific case - it could be any transient error why the update fails
  • varnish master via CLI returns fail (to the kube-httpcache)
  • VCL is not updated

Preliminary fix idea:

  • in watch.go replace: errors <- v.rebuildConfig(ctx)
  • with something like: v.rebuildConfigWithRetry(ctx, errors)
  • new function would call v.rebuildConfig(ctx) and retry the rebuild after error

If I find some time, I can try to implement the fix, but unsure when.
Example of error situation:
2022_12_07 17_07_06
2022_12_07 17_09_58
The line which begins with "W" is from kube-httpcache, rest without prefix come from the varnish process itself.
Note that we run the pods with increased log level -v=7

Note that in our case probably the problem could be mitigated by fine-tuning varnish itself, so that the child "never dies". And it could be questionable if it is a bug or feature (as the error is already logged as warning). But I think some kind of retry logic would be useful to make the VCL update more resilient in the cases where varnish VCL update fails, but the failure is transient and would succeed on retry. Because even if it happens on quite rare occasions, the result is part of the cluster running with old config, which could be very dangerous.

@robert7 robert7 added the bug Something isn't working label Dec 13, 2022
@martin-helmich
Copy link
Member

Thanks for the report and the detailed analysis -- and apologies that this issue has been laying around for a while. I concur that the VCL file not updating silently without any feedback is most undesirable.

Even if one were able to "fine-tun[e] varnish itself, so that the child 'never dies'" (ironically, I'm not that familiar with Varnish to know how I'd even do that 😕), we should still account for the possibility of random failures when loading the VCL into Varnish, and either add some retry logic or have the controller bail out entirely and enforcing a Pod restart.

I can also attempt to build a quick fix for this, but cannot make any promises as to when. In the mean time, PRs are of course welcome. 🙂

@martin-helmich martin-helmich added the help-wanted Extra attention is needed label Apr 12, 2023
@robert7
Copy link
Contributor Author

robert7 commented Apr 13, 2023

hi, thanks for reply, I still have this on "radar", but I also did not have time to fix the problem yet. But I will probably fix it one day (and in this case I'll open a PR).

@jfcoz
Copy link
Contributor

jfcoz commented Oct 8, 2024

Hello,
I have a similar problem, I’m not sure that this is for the same reason.

To reproduce:
If I scale the backend to 20 pods, and then I do a rollingUpdate with maxSurge: 30%, when too many pods start/stop in a short time, kube-httpcache seems to bee to slow to check pod status of each endpoint, update and reload the VCL.

Workaround:
Set --backend-watch=false, and set directly the backend service name and port in the VCL.

@pserrano
Copy link

Hello, I have a similar problem, I’m not sure that this is for the same reason.

To reproduce: If I scale the backend to 20 pods, and then I do a rollingUpdate with maxSurge: 30%, when too many pods start/stop in a short time, kube-httpcache seems to bee to slow to check pod status of each endpoint, update and reload the VCL.

Workaround: Set --backend-watch=false, and set directly the backend service name and port in the VCL.

Same here, in my case with 150 pods it's too slow as well. I did the workaround meanwhile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help-wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants