Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K0S run via docker-compose doesn't recover from host rebooting (single host) #5023

Open
4 tasks done
tmeltser opened this issue Sep 22, 2024 · 6 comments
Open
4 tasks done
Assignees
Labels
bug Something isn't working

Comments

@tmeltser
Copy link

tmeltser commented Sep 22, 2024

Before creating an issue, make sure you've checked the following:

  • You are running the latest released version of k0s
  • Make sure you've searched for existing issues, both open and closed
  • Make sure you've searched for PRs too, a fix might've been merged already
  • You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Platform

as360@AS360-AIO-Ubuntu:~$ uname -srvmo; cat /etc/os-release || lsb_release -a
Linux 6.8.0-45-generic #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024 x86_64 GNU/Linux
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

Version

v1.30.4+k0s.0

Sysinfo

`k0s sysinfo`
Total memory: 35.2 GiB (pass)
Disk space available for /var/lib/k0s: 197.0 GiB (pass)
Name resolution: localhost: [::1 127.0.0.1] (pass)
Operating system: Linux (pass)
  Linux kernel release: 6.8.0-45-generic (pass)
  Max. file descriptors per process: current: 1048576 / max: 1048576 (pass)
  AppArmor: unavailable (pass)
  Executable in PATH: modprobe: /sbin/modprobe (pass)
  Executable in PATH: mount: /bin/mount (pass)
  Executable in PATH: umount: /bin/umount (pass)
  /proc file system: mounted (0x9fa0) (pass)
  Control Groups: version 2 (pass)
    cgroup controller "cpu": available (is a listed root controller) (pass)
    cgroup controller "cpuacct": available (via cpu in version 2) (pass)
    cgroup controller "cpuset": available (is a listed root controller) (pass)
    cgroup controller "memory": available (is a listed root controller) (pass)
    cgroup controller "devices": available (device filters attachable) (pass)
    cgroup controller "freezer": available (cgroup.freeze exists) (pass)
    cgroup controller "pids": available (is a listed root controller) (pass)
    cgroup controller "hugetlb": available (is a listed root controller) (pass)
    cgroup controller "blkio": available (via io in version 2) (pass)
  CONFIG_CGROUPS: Control Group support: no kernel config found (warning)
  CONFIG_NAMESPACES: Namespaces support: no kernel config found (warning)
  CONFIG_NET: Networking support: no kernel config found (warning)
  CONFIG_EXT4_FS: The Extended 4 (ext4) filesystem: no kernel config found (warning)
  CONFIG_PROC_FS: /proc file system support: no kernel config found (warning)

What happened?

K0S running on a single node system (multiple services run by docker-compose) doesn't survive multiple reboots (it comes up after a few, and then it doesn't, or it doesn't come up at all after some reboots).
Attached below a sample docker-compose file to demo the problem.
Tried on Ubuntu 24.04 and CentOS 9 - same results

Steps to reproduce

  1. Take the sample docker-compose file (attached below)
  2. Run the following command: docker compose -f aio-compose-sample.yaml up -d --wait
  3. Reboot the host several times, at some point, after a handful of restarts (or even after the first restart), the K0S gets broken

Expected behavior

The K0S should survive restarts, always.

Actual behavior

After a few restarts, the K0S breaks down:

# docker compose -f aio-compose-sample.yaml exec k0s k0s kubectl get pods -A
NAMESPACE       NAME                                           READY   STATUS        RESTARTS       AGE
cert-manager    cert-manager-9647b459d-hlxr2                   1/1     Running       1 (4h3m ago)   4h15m
cert-manager    cert-manager-cainjector-5d8798687c-h8lk4       1/1     Running       2 (4h3m ago)   4h15m
cert-manager    cert-manager-webhook-c77744d75-b5vcn           1/1     Running       1 (4h3m ago)   4h15m
ingress-nginx   ingress-nginx-admission-create-bxxgb           0/1     Pending       0              17m
ingress-nginx   ingress-nginx-admission-create-lh4p7           0/1     Terminating   0              3h58m
ingress-nginx   ingress-nginx-controller-55df698df5-6vtxj      1/1     Running       1 (4h3m ago)   4h16m
k0s-system      k0s-pushgateway-86bd768578-cp7cq               1/1     Running       1 (4h3m ago)   4h17m
kube-system     coredns-85c69f454c-2hgn7                       1/1     Running       1 (4h3m ago)   4h17m
kube-system     konnectivity-agent-27m8k                       1/1     Terminating   1 (4h3m ago)   4h17m
kube-system     kube-proxy-pxnl5                               1/1     Running       1 (4h3m ago)   4h17m
kube-system     kube-router-84vsr                              1/1     Terminating   1 (4h3m ago)   4h17m
kube-system     metrics-server-7cc78958fc-gkj7l                1/1     Running       1 (4h3m ago)   4h17m
openebs         openebs-localpv-provisioner-86d8949887-49rr7   1/1     Running       0              4h2m
openebs         openebs-pre-upgrade-hook-6jcts                 0/1     Pending       0              3h38m
# docker compose -f aio-compose-sample.yaml exec k0s k0s kubectl get nodes
NAME   STATUS     ROLES           AGE     VERSION
k0s    NotReady   control-plane   4h18m   v1.30.4+k0s

Screenshots and logs

Kindly advise what logs are needed, and I'll be happy to add them.

Additional context

Adding a sample docker compose to demonstrate the problem:
aio-compose-sample.zip

Docker version info:

as360@AS360-AIO-Ubuntu:~$ docker compose -f aio-compose-sample.yaml exec k0s k0s status
Version: v1.30.4+k0s.0
Process ID: 7
Role: controller
Workloads: true
SingleNode: false
Kube-api probing successful: true
Kube-api probing last error:

as360@AS360-AIO-Ubuntu:~$ docker version
Client: Docker Engine - Community
 Version:           27.3.1
 API version:       1.47
 Go version:        go1.22.7
 Git commit:        ce12230
 Built:             Fri Sep 20 11:40:59 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.3.1
  API version:      1.47 (minimum version 1.24)
  Go version:       go1.22.7
  Git commit:       41ca978
  Built:            Fri Sep 20 11:40:59 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.22
  GitCommit:        7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
 runc:
  Version:          1.1.14
  GitCommit:        v1.1.14-0-g2c9f560
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
@tmeltser tmeltser added the bug Something isn't working label Sep 22, 2024
@juanluisvaladas juanluisvaladas self-assigned this Sep 23, 2024
@twz123
Copy link
Member

twz123 commented Sep 27, 2024

Kindly advise what logs are needed, and I'll be happy to add them.

The logs of the k0s Docker container would be helpful, the logs of the failing containers, too. You could also add the --debug flag to k0s, so there's even more detailed logs. You might want to try to collect a support bundle, as well.

@tmeltser
Copy link
Author

tmeltser commented Sep 29, 2024

Hi,
PFA the K0S (debug flag turned on) log file: k0s.log
As for the logs of the failed containers, kubectl doesn't seem to be able to supply them at this state (Ubuntu host, after reboor):

as360@AS360-AIO-Ubuntu:~$ docker compose -f aio-compose-sample.yaml exec k0s k0s kubectl logs konnectivity-agent-j6llm -n kube-system
Error from server: Get "https://172.17.0.2:10250/containerLogs/kube-system/konnectivity-agent-j6llm/konnectivity-agent": No agent available
as360@AS360-AIO-Ubuntu:~$ docker compose -f aio-compose-sample.yaml exec k0s k0s kubectl logs ingress-nginx-admission-create-9x6tb -n ingress-nginx
Error from server: Get "https://172.17.0.2:10250/containerLogs/ingress-nginx/ingress-nginx-admission-create-9x6tb/create": No agent available

@twz123
Copy link
Member

twz123 commented Oct 8, 2024

Could be that there are some stuck containers from previous runs. When shutting down k0s, it won't stop running pods/containers. You need to drain the node manually. Moreover, when running k0s in Docker, the cgroups hierarchy is possibly not properly respected, and container processes might keep running (or at least their cgroup hierarchy). I can imagine that this causes some troubles. Can you maybe try to add volumes for /opt/cni and /etc/cni/net.d to your compose config? After having looked at logs, I assume that some old kube-router container is blocking a new one, but the old one can't be removed properly by containerd, because after the restart, the CNI plugins are no longer installed.

@tmeltser
Copy link
Author

tmeltser commented Oct 8, 2024

I can't drain the node manually since we are talking about unexpected machine restart/reboot.
As for the asked volumes, NP, I'll do that and update back (I assume we are talking about anonymous volumes and not host mounted - right?).

@tmeltser
Copy link
Author

tmeltser commented Oct 9, 2024

I've added the 2 new (anonymous) volumes and it didn't make any difference (the K0S failed to come up after the first restart), Attaching the updated sample compose file:
aio-compose-sample.zip
Attaching K0S log file:
k0s.log
Any other sueggstions to resolve this problem?

@tmeltser
Copy link
Author

Any advice on the subject would be very much appreciated...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants