Better handling of unsupported combos in dynamic configuration #4725

twz123 · 2024-07-05T15:08:11Z

Is your feature request related to a problem? Please describe.

Some k0s configuration options are mutually exclusive, others cannot be changed after cluster creation. Currently, there is minimal sanity checking for dynamic configuration. As a result, unsupported configuration combinations can potentially break a cluster completely. Debugging such a problem is complex, as helpful error messages only appear during cluster creation. See #4721 for an example.

Describe the solution you would like

Prevent invalid configurations from being stored in the cluster:
Add CEL (Common Expression Language) validation rule markers to the various ClusterConfig structs. This allows validation to be performed by the API server already, preventing invalid configurations from reaching the cluster in the first place.
Graceful handling of unsupported configuration values. Depending on the effectiveness of point 1, there are several options:
1. If the configuration validation fails, k0s doesn't reconcile the configuration at all, and the components remain on the last valid configuration. This is easy and straightforward to implement. While tempting, this could undermine the reconciliation of other valid and safe configuration parts, and is therefore only a good choice if point 1 is effective, and invalid configurations stored within a cluster can be considered a pathological edge case.
2. Try to get as close to the desired configuration as possible without breaking the cluster. This might involve "resolving" an invalid desired configuration by comparing it to the last valid one, into a "fixed" target configuration that passes validation and can be safely reconciled. This is a more elaborate and error-prone approach, and may not be necessary if point 1 proves effective.

Describe alternatives you've considered

A validating webhook is a more powerful approach to preventing invalid configurations from being stored in the cluster. At the same time, it's much heavier and more complex. It does not currently add significant value over the simpler CEL approach.

"Let it crash". Terminate the process and wait for a restart. This is not suitable as it will bring down the local API server, making it difficult to fix invalid configurations, and could also harm the entire etcd cluster in an HA scenario.

Additional context

A particular challenge is the stack applier, which is used by almost all k0s components to manage resources in the cluster. Currently, there is no good way to suspend a stack to prevent it from being applied. This may become necessary to prevent bad things from happening. Suspending stack reconciliation could be achieved by suspending leader election globally, but this might prevent partial reconciliation as discussed above.

The text was updated successfully, but these errors were encountered:

twz123 · 2024-07-05T15:27:17Z

#4674 already addresses some non-CEL validation parts for the CRDs.

twz123 added enhancement New feature or request area/controlplane area/configuration labels Jul 5, 2024

twz123 mentioned this issue Jul 11, 2024

Remove package-wide kubebuilder:validation:optional annotation #4674

Merged

16 tasks

juanluisvaladas added the config-2.0 label Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of unsupported combos in dynamic configuration #4725

Better handling of unsupported combos in dynamic configuration #4725

twz123 commented Jul 5, 2024

twz123 commented Jul 5, 2024

Better handling of unsupported combos in dynamic configuration #4725

Better handling of unsupported combos in dynamic configuration #4725

Comments

twz123 commented Jul 5, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you would like

Describe alternatives you've considered

Additional context

twz123 commented Jul 5, 2024