Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

talos.inline.config not working in Omni generated VMWare ova image #9503

Open
sedh-sbab opened this issue Oct 15, 2024 · 11 comments
Open

talos.inline.config not working in Omni generated VMWare ova image #9503

sedh-sbab opened this issue Oct 15, 2024 · 11 comments

Comments

@sedh-sbab
Copy link

Bug Report

When I generate a Talos v1.8.1 image for VMWare platform using omnictl our CA Certificate config is ignored when using the talos.config.inline kernel args and we get x509 Certificate errors in the console log of the Talos VM when trying to connect to our Omni instance.

I provide --extra-kernel-args talos.config.inline=${TALOS_CONFIG_INLINE} to omnictl where $TALOS_CONFIG_INLINE is created by using the guide here https://www.talos.dev/v1.8/reference/kernel/#talosconfiginline. The config document is a CA certificate, see https://www.talos.dev/v1.8/talos-guides/configuration/certificate-authorities/#appending-the-certificate-authority. I have tried using the offical factory.talos.dev with the same result. I have checked the GRUB menu and the talos.config.inline key and value is present.

If I instead provide the same CA certificate config document as a base64 encoded string and instead use the VMware guestinfo the CA certificate works great and the node can connect to our Omni instance without any errors. I use this command to insert the config document to the VM host,

govc vm.change \
  -e "guestinfo.talos.config=$(cat ca-root-config.yml | base64)"
....

I have tried to wipe and reset the machine and edit the kernel arguments to change the platform and remove the talos.config=guestinfo line without any luck. But am not sure it has anything to do with this.

Platform: VMWare (OVA template)
Talos Version: v1.8.1

@smira
Copy link
Member

smira commented Oct 15, 2024

Please provide kernel logs.

P.S. It's way better to use userdata than talos.config.inline with Omni.

@sedh-sbab
Copy link
Author

The kernel log: the best I can do is an image, hope that works
image
The rest of the logs are mostly from time.syncController that can't connect out to internet.

The status of the node stays like this forever:
image

P.S For the userdata part, that actually sounds very reasonable, since it don't have quite the same limitations. Thank you.

@smira
Copy link
Member

smira commented Oct 15, 2024

We need full kernel logs, (serial console logs) to understand why the config failed to load. We can't debug much without it, sorry.

@sedh-sbab
Copy link
Author

We need full kernel logs, (serial console logs) to understand why the config failed to load. We can't debug much without it, sorry.

I'll check if I can attach a serial and save to disk

@sedh-sbab
Copy link
Author

Here it is! I have redacted the sensitive information.
console-log.txt

On line 15 and 104 the talos.config.inline is clearly missing. I can see it in the GRUB menu though.

@smira
Copy link
Member

smira commented Oct 15, 2024

If you're booting from the OVA, it should be there, unless there was something else happening (like an upgrade) which would wipe that kernel argument?

@sedh-sbab
Copy link
Author

sedh-sbab commented Oct 15, 2024

It's very strange, I have to do some more digging. But no upgrade or any adjustments are made, they are clearly visible in the grub edit menu. Steps are,

  1. Generate with omnictl
  2. Upload to our content directory with govc
  3. Deploy it. (I make no adjustments or modifications in this step, simply New VM from template)
  4. Start

These console logs are of a completely fresh machine I created.

Here is the full omnictl command with expanded variables:

omnictl download vmware \
      --talos-version v1.8.1 \
      --arch amd64 \
      --extensions vmtoolsd-guest-agent \
      --initial-labels environment=<env> --initial-labels region=<REGION> \
      --extra-kernel-args talos.config.inline=$(cat sbab-root-ca.yml | zstd --compress --ultra -22 | base64 -w 0) \
     --output _out/v1.8.1-<REGION>-common

GRUB image:
image

@smira
Copy link
Member

smira commented Oct 15, 2024

I wonder if it's too big and gets cut by GRUB... maybe your certificate is RSA? ECDSA is way smaller

@sedh-sbab
Copy link
Author

sedh-sbab commented Oct 16, 2024

Well in totalt with our talos.config.inline the whole command is 2700 bytes.

BOOT_IMAGE=/A/vmlinuz talos.platform=vmware talos.config=guestinfo console=tty0 console=ttyS0 earlyprintk=ttyS0,115200 net.ifnames=0 init_on_alloc=1 slab_nomerge pti=on consoleblank=0 nvme_core.io_timeout=4294967295 printk.devkmsg=on ima_template=ima-ng ima_appraise=fix ima_hash=sha512 siderolink.api=https://<REDACTED>:443?grpc_tunnel=false&jointoken=<REDACTED> talos.events.sink=[fdae:41e4:649b:9303::1]:8091 talos.logging.kernel=tcp://[fdae:41e4:649b:9303::1]:8092 talos.config.line=<2213 bytes>

This is without the redacted stuff.

❯ wc -c talos-kernel-args.txt
2700 talos-kernel-args.txt

In your documentation it says the Linux kernel args has a max size of 4096, but maybe grub has another limit?

@smira
Copy link
Member

smira commented Oct 16, 2024

yes, it might be GRUB or the boot protocol used with GRUB limit (I guess you're booting in BIOS mode on VMWare?)

@sedh-sbab
Copy link
Author

yes, it might be GRUB or the boot protocol used with GRUB limit (I guess you're booting in BIOS mode on VMWare?)

Yes, BIOS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants