If etcd fails to sync config during initial start sequence and k0s restarts, node creates a new cluster rather than joining existing #5149

emosbaugh · 2024-10-23T12:51:14Z

Before creating an issue, make sure you've checked the following:

You are running the latest released version of k0s
Make sure you've searched for existing issues, both open and closed
Make sure you've searched for PRs too, a fix might've been merged already
You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Platform

No response

Version

v1.28.14+k0s.0

Sysinfo

`k0s sysinfo`

➡️ Please replace this text with the output of `k0s sysinfo`. ⬅️

What happened?

When I join many controller nodes in parallel, the kubernetes api can become unstable for a period. This results in the initial etcd join failing to sync the etcd config and the k0s process exiting.

Oct 18 04:04:39 node-a50a2-04 k0s[3461]: time="2024-10-18 04:04:39" level=info msg="starting Etcd"
Oct 18 04:04:39 node-a50a2-04 k0s[3461]: time="2024-10-18 04:04:39" level=info msg="Starting etcd"
Oct 18 04:04:50 node-a50a2-04 k0s[3461]: Error: failed to start controller node components: failed to sync etcd config: unexpected response status when trying to join etcd cluster: 500 Internal Server Error
Oct 18 04:04:50 node-a50a2-04 systemd[1]: k0scontroller.service: Main process exited, code=exited, status=1/FAILURE
Oct 18 04:04:50 node-a50a2-04 systemd[1]: k0scontroller.service: Failed with result 'exit-code'.
Oct 18 04:04:50 node-a50a2-04 systemd[1]: k0scontroller.service: Consumed 3.172s CPU time.
Oct 18 04:05:00 node-a50a2-04 systemd[1]: k0scontroller.service: Scheduled restart job, restart counter is at 1.
Oct 18 04:05:00 node-a50a2-04 systemd[1]: Stopped k0scontroller.service - k0s - Zero Friction Kubernetes.
Oct 18 04:05:00 node-a50a2-04 systemd[1]: k0scontroller.service: Consumed 3.172s CPU time.
Oct 18 04:05:00 node-a50a2-04 systemd[1]: Started k0scontroller.service - k0s - Zero Friction Kubernetes.

When k0s starts back up, rather than join the cluster, it seems to create a new cluster.

Oct 18 04:05:02 node-a50a2-04 k0s[3512]: time="2024-10-18 04:05:02" level=info msg="starting Etcd"
Oct 18 04:05:02 node-a50a2-04 k0s[3512]: time="2024-10-18 04:05:02" level=info msg="Starting etcd"
Oct 18 04:05:03 node-a50a2-04 k0s[3512]: time="2024-10-18 04:05:03" level=info msg="Starting to supervise" component=etcd
Oct 18 04:05:03 node-a50a2-04 k0s[3512]: time="2024-10-18 04:05:03" level=info msg="Started successfully, go nuts pid 3537" component=etcd
...
Oct 18 04:05:03 node-a50a2-04 k0s[3512]: time="2024-10-18 04:05:03" level=info msg="{\"level\":\"info\",\"ts\":\"2024-10-18T04:05:03.727336Z\",\"caller\":\"etcdmain/etcd.go:73\",\"msg\":\"Running: \",\"args\":[\"/var/lib/embedded-cluster/k0s/bin/etcd\",\"--tls-min-version=TLS1.2\",\"--data-dir=/var/lib/embedded-cluster/k0s/etcd\",\"--name=node-a50a2-04\",\"--key-file=/var/lib/embedded-cluster/k0s/pki/etcd/server.key\",\"--peer-trusted-ca-file=/var/lib/embedded-cluster/k0s/pki/etcd/ca.crt\",\"--peer-key-file=/var/lib/embedded-cluster/k0s/pki/etcd/peer.key\",\"--peer-cert-file=/var/lib/embedded-cluster/k0s/pki/etcd/peer.crt\",\"--listen-client-urls=https://127.0.0.1:2379\",\"--listen-peer-urls=https://10.0.0.6:2380\",\"--log-level=info\",\"--auth-token=jwt,pub-key=/var/lib/embedded-cluster/k0s/pki/etcd/jwt.pub,priv-key=/var/lib/embedded-cluster/k0s/pki/etcd/jwt.key,sign-method=RS512,ttl=10m\",\"--cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256\",\"--advertise-client-urls=https://127.0.0.1:2379\",\"--client-cert-auth=true\",\"--peer-client-cert-auth=true\",\"--enable-pprof=false\",\"--initial-advertise-peer-urls=https://10.0.0.6:2380\",\"--cert-file=/var/lib/embedded-cluster/k0s/pki/etcd/server.crt\",\"--trusted-ca-file=/var/lib/embedded-cluster/k0s/pki/etcd/ca.crt\"]}" component=etcd stream=stderr
...
Oct 18 04:05:03 node-a50a2-04 k0s[3512]: time="2024-10-18 04:05:03" level=info msg="{\"level\":\"info\",\"ts\":\"2024-10-18T04:05:03.839004Z\",\"caller\":\"etcdserver/server.go:738\",\"msg\":\"started as single-node; fast-forwarding election ticks\",\"local-member-id\":\"40e0ff5ee27c98d0\",\"forward-ticks\":9,\"forward-duration\":\"900ms\",\"election-ticks\":10,\"election-timeout\":\"1s\"}" component=etcd stream=stderr

This seems to be due to a bad assumption in this function

// If we've got CA in place we assume the node has already joined previously
func (c *command) needToJoin(nodeConfig *v1beta1.ClusterConfig) bool {
	if file.Exists(filepath.Join(c.K0sVars.CertRootDir, "ca.key")) &&
		file.Exists(filepath.Join(c.K0sVars.CertRootDir, "ca.crt")) {
		return false
	}
	if nodeConfig.Spec.Storage.Type == v1beta1.EtcdStorageType && !nodeConfig.Spec.Storage.Etcd.IsExternalClusterUsed() {
		return !file.Exists(filepath.Join(c.K0sVars.EtcdDataDir, "member", "snap", "db"))
	}
	return true
}

Steps to reproduce

Join many controller nodes in parallel

Expected behavior

When joining a node, it will join the existing cluster

Actual behavior

Joining a node creates a new cluster in some circumstances

Screenshots and logs

k0scontroller-logs.txt
k0scontroller-logs.txt
k0scontroller-logs.txt
k0scontroller-logs.txt
k0scontroller-logs.txt

Additional context

No response

The text was updated successfully, but these errors were encountered:

twz123 · 2024-10-23T13:49:37Z

What would be a better way to determine if an existing cluster should be joined? I could imagine that k0s could delete the certs if joining the cluster fails, too...

twz123 · 2024-10-23T13:51:59Z

Also: k0s could retry 5xx responses in a back-off loop...

twz123 · 2024-10-23T14:39:15Z

@emosbaugh regarding #5151 and #5149 (comment): Would it make more sense to introduce a special marker file in the k0s data dir that k0s writes as soon as the join process is finished, instead of trying to check several places?

emosbaugh · 2024-10-23T14:55:14Z

Also: k0s could retry 5xx responses in a back-off loop...

This already happens today but eventually it gives up in this case

https://github.com/k0sproject/k0s/blob/main/pkg/component/controller/etcd.go#L103-L115

emosbaugh · 2024-10-23T14:55:47Z

@emosbaugh regarding #5151 and #5149 (comment): Would it make more sense to introduce a special marker file in the k0s data dir that k0s writes as soon as the join process is finished, instead of trying to check several places?

That makes sense to me. Is there a directory and path that would be appropriate to store this file?

emosbaugh · 2024-10-23T16:13:06Z

@emosbaugh regarding #5151 and #5149 (comment): Would it make more sense to introduce a special marker file in the k0s data dir that k0s writes as soon as the join process is finished, instead of trying to check several places?

That makes sense to me. Is there a directory and path that would be appropriate to store this file?

Thinking a bit more about this... I feel like it is better to have a single source of truth, ideally etcd itself. We could use the etcd database file for that as i have it or perhaps use the result of syncEtcdConfig to detect if the current node is already a member of the cluster.

jnummelin · 2024-10-29T21:41:59Z

Not sure I grokked this correctly, but I think the problem is not fully on the joining side of the code but also on the join api side of things. I mean what happens is that with 1 etcd member, you now create join request for 2 more in parallel. What happens on etcd is that once we create the new member on node 1, but node 2 has not really joined the cluster yet (etcd hasn't been started yet), there's no quorum. And at the same time we do the same for node 3. So we basically bork etcd ourselves.

I feel like it is better to have a single source of truth, ideally etcd itself.

I agree with this. And reflecting this to what I wrote above, I think the join API should actually check etcd state more closely to see if we can actually allow another member to join. I mean when we have 1 member up, we can only allow 1 more to be fully joined (member created and actually reached quorum). Only after this we can allow for the next one and so on.

On the join api side of things we'd probably want to use some suitable HTTP status code to tell that "I cannot allow you to join at this time as there's no quorum, try again in a bit". Maybe 503 with some Retry-After header.

twz123 · 2024-10-30T10:13:53Z

Thinking a bit more about this... I feel like it is better to have a single source of truth, ideally etcd itself. We could use the etcd database file for that as i have it or perhaps use the result of syncEtcdConfig to detect if the current node is already a member of the cluster.

I kinda like the marker file because it's a) super-dumb to implement, b) super-easy to delete, in case somebody wants to enforce a rejoin, and c) backend agnositc. It would also work with, say kine/NATS types of setups.

twz123 · 2024-10-30T10:14:58Z

On the join api side of things we'd probably want to use some suitable HTTP status code to tell that "I cannot allow you to join at this time as there's no quorum, try again in a bit". Maybe 503 with some Retry-After header.

Agree: #5149 (comment)

emosbaugh · 2024-10-30T18:01:47Z

I agree that the issue is that the api is unstable when many nodes are joining and there is no quorum. Eventually the api will become stable in my scenario and the node will be able to join. It just does not wait long enough.

When it does give up and restart, it checks the pki certs incorrectly to decide if it has already joined, which it has not. When it sees that they exist and determines that it does not need to join, it starts a new cluster rather than joining an existing cluster (joinClient is nil). If you run kubectl get node on the node that started its own cluster it will be 1 of 1 node in a healthy cluster.

Are you suggesting that we continue to retry forever until join is successful? What if in other scenarios the api does not become healthy? What if the process restarts for another reason? It will still exhibit the same behavior in creating a new cluster. Therefore in my opinion, the real issue here is not one of backoff/retry but the check for "do i need to join?".

jnummelin · 2024-10-31T10:56:52Z

Are you suggesting that we continue to retry forever until join is successful?

I don't think we want to wait forever, but for longer than we currently do for sure.

Therefore in my opinion, the real issue here is not one of backoff/retry but the check for "do i need to join?".

Absolutely, that is part of the problem. And maybe the most important part. I'm just pointing out that the api side has some issues too which we want to address too. And by fixing both sides we make it much more robust.

emosbaugh added the bug Something isn't working label Oct 23, 2024

twz123 added area/controlplane component/etcd labels Oct 23, 2024

emosbaugh linked a pull request Oct 23, 2024 that will close this issue

fix: join node creates new cluster when initial etcd sync config fails #5151

Draft

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If etcd fails to sync config during initial start sequence and k0s restarts, node creates a new cluster rather than joining existing #5149

If etcd fails to sync config during initial start sequence and k0s restarts, node creates a new cluster rather than joining existing #5149

emosbaugh commented Oct 23, 2024 •

edited

Loading

twz123 commented Oct 23, 2024 •

edited

Loading

twz123 commented Oct 23, 2024

twz123 commented Oct 23, 2024

emosbaugh commented Oct 23, 2024

emosbaugh commented Oct 23, 2024

emosbaugh commented Oct 23, 2024

jnummelin commented Oct 29, 2024

twz123 commented Oct 30, 2024

twz123 commented Oct 30, 2024

emosbaugh commented Oct 30, 2024 •

edited

Loading

jnummelin commented Oct 31, 2024

If etcd fails to sync config during initial start sequence and k0s restarts, node creates a new cluster rather than joining existing #5149

If etcd fails to sync config during initial start sequence and k0s restarts, node creates a new cluster rather than joining existing #5149

Comments

emosbaugh commented Oct 23, 2024 • edited Loading

Before creating an issue, make sure you've checked the following:

Platform

Version

Sysinfo

What happened?

Steps to reproduce

Expected behavior

Actual behavior

Screenshots and logs

Additional context

twz123 commented Oct 23, 2024 • edited Loading

twz123 commented Oct 23, 2024

twz123 commented Oct 23, 2024

emosbaugh commented Oct 23, 2024

emosbaugh commented Oct 23, 2024

emosbaugh commented Oct 23, 2024

jnummelin commented Oct 29, 2024

twz123 commented Oct 30, 2024

twz123 commented Oct 30, 2024

emosbaugh commented Oct 30, 2024 • edited Loading

jnummelin commented Oct 31, 2024

emosbaugh commented Oct 23, 2024 •

edited

Loading

twz123 commented Oct 23, 2024 •

edited

Loading

emosbaugh commented Oct 30, 2024 •

edited

Loading