Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve storage model of Caddy's pods in K8s #737

Open
angonz opened this issue Oct 20, 2022 · 4 comments
Open

Improve storage model of Caddy's pods in K8s #737

angonz opened this issue Oct 20, 2022 · 4 comments
Assignees
Labels
bug Bugs will be investigated and fixed as quickly as possible. feature request New features will be processed by decreasing priority help wanted Easy issues that can be tackled by newcomers

Comments

@angonz
Copy link
Contributor

angonz commented Oct 20, 2022

I would like to start a discussion about the persistence model of Caddy.
Currently, Caddy's deployment definition has a generic persistent volume claim which will cause the cluster to create a generic persistent volume and mount to the /data directory.

In our first failover tests, we've found that, in multi-AZ environments, Caddy will fail rescheduling to a different AZ, as a pod cannot bind to a PV in a different AZ (I have a post in discuss about this).

Now a commit to fix rolling up updates in Caddy makes it even harder, as all pods for Caddy must be in the same node as the original ReplicaSet. Recently we had a site outage when Caddy crashed, and Caddy failed to reschedule due to lack of resources in the original node, and was prevented to reschedule to another node. We had to delete the volume and deployment manually and then Caddy was rescheduled to another node.

Additionally, there is an excellent backup plugin, which makes a backup of the Caddy's data volume. To do this, it includes a node affinity with Caddy to access its volume. It is also an issue, because if the node does not have enough resources to allocate the pods for the backup or restore jobs, these tasks will fail.

The idea behind K8s is to have nodes tightly dimensioned to support its current workloads, and let the scheduler assign nodes to pods dynamically wherever there is room. So it is frequent that nodes do not have resources and crashing pods need to be rescheduled to another pod. Too many node affinity constraints and taints limit the scheduler ability and may lead to pods failing to start.

AFAIK, Caddy uses this volume to store only the SSL certificates, which are generated dynamically and can be recreated if lost. The other core pods of Open edX do not require any PV (out of MySQL, MongoDB, ElasticSearch, Redis and MinIO, which can be consumed as a service out of the K8s cluster).

I would like to start a discussion to change the way Caddy stores this data, to improve its scalability and resiliency.

As a starting point, we can review some of these options:

  • Use EmptyDir volume, which is lost when the pod is rescheduled. If I remember well I had a talk with @regisb about this and he discouraged this idea.
  • Use an ephemeral volume. Same as EmptyDir, but using the cluster storage instead of the node's. Similarly, will be deleted together with the pod.
  • Use NFS volume. Probably the most scalable and stable solution, but more complex to configure and more dependent on the cluster provider. Don't know if there will be performance issues.
@regisb
Copy link
Contributor

regisb commented Oct 21, 2022

Aye aye, I agree that there is an issue and it's important that we fix it. Let me pull Florian @fghaas into the conversation, as he proposed one of the original fixes and he also has extensive expertise about k8s.

Use EmptyDir volume, which is lost when the pod is rescheduled. If I remember well I had a talk with @regisb about this and he discouraged this idea.

I don't remember this conversation, but I assume that my argument was that we needed to preserve SSL certificates or we would be rate-limited by let's encrypt's servers. I am also facing this issue when I redeploy the demo Open edX platform 30 times/month.

To be honest, I don't feel very competent to propose a smart solution -- although I do understand the problem. What would you suggest?

@regisb regisb added help wanted Easy issues that can be tackled by newcomers feature request New features will be processed by decreasing priority bug Bugs will be investigated and fixed as quickly as possible. labels Oct 21, 2022
@fghaas
Copy link
Contributor

fghaas commented Oct 21, 2022

AFAIK, Caddy uses this volume to store only the SSL certificates, which are generated dynamically and can be recreated if lost.

That recreation can take up to 15 minutes in my experience, and that's not counting ACME service disruptions. I would argue that 15 minutes (or longer if the ACME service happens to be unavailable) is not an acceptable service interruption, so I think just relying on automatic cert regeneration in case of a pod being rescheduled to another node is not an option. In other words: we do need the certificate data in a PV.

Use NFS volume. Probably the most scalable and stable solution, but more complex to configure and more dependent on the cluster provider. Don't know if there will be performance issues.

More broadly, use a volume with an access mode other than the RWO mode we currently use.

If

  • you want to be able to run Caddy and the backup container from tutor-contrib-backup on different nodes, or
  • you want a deployment strategy that is RollingUpdate rather than Replace, and you want the replacing Pod to run on a different node than the original one,

then you'll need an RWX type volume. I'd say that this isn't something you can expect from every Kubernetes provider out there.

So, the only thing I can think of here is to make the Caddy PVC's access mode configurable somehow:

  • If it's ReadWriteOnce (default), set the affinity rules.
  • If it's ReadWriteMany, don't set the affinity rules.

Is exposing that implementation detail actually useful and beneficial to users?

@regisb
Copy link
Contributor

regisb commented Sep 22, 2023

I have let this issue linger for too long, sorry about that... @wahajArbisoft what's your take on this issue?

@BcTpe4HbIu
Copy link

Kubernetes has much more stable and scalable ways to get certificates (cert-manager for example) and manage ingress traffic (cluster ingress controllers like traefik). For larger deployments it should be a recommended way.

If PVC only used for certificate storage, then setting ENABLE_WEB_PROXY: false should remove PVC and affinity rules, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs will be investigated and fixed as quickly as possible. feature request New features will be processed by decreasing priority help wanted Easy issues that can be tackled by newcomers
Projects
Development

No branches or pull requests

5 participants