You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Some k0s configuration options are mutually exclusive, others cannot be changed after cluster creation. Currently, there is minimal sanity checking for dynamic configuration. As a result, unsupported configuration combinations can potentially break a cluster completely. Debugging such a problem is complex, as helpful error messages only appear during cluster creation. See #4721 for an example.
Describe the solution you would like
Prevent invalid configurations from being stored in the cluster:
Add CEL (Common Expression Language) validation rule markers to the various ClusterConfig structs. This allows validation to be performed by the API server already, preventing invalid configurations from reaching the cluster in the first place.
Graceful handling of unsupported configuration values. Depending on the effectiveness of point 1, there are several options:
If the configuration validation fails, k0s doesn't reconcile the configuration at all, and the components remain on the last valid configuration. This is easy and straightforward to implement. While tempting, this could undermine the reconciliation of other valid and safe configuration parts, and is therefore only a good choice if point 1 is effective, and invalid configurations stored within a cluster can be considered a pathological edge case.
Try to get as close to the desired configuration as possible without breaking the cluster. This might involve "resolving" an invalid desired configuration by comparing it to the last valid one, into a "fixed" target configuration that passes validation and can be safely reconciled. This is a more elaborate and error-prone approach, and may not be necessary if point 1 proves effective.
Describe alternatives you've considered
A validating webhook is a more powerful approach to preventing invalid configurations from being stored in the cluster. At the same time, it's much heavier and more complex. It does not currently add significant value over the simpler CEL approach.
"Let it crash". Terminate the process and wait for a restart. This is not suitable as it will bring down the local API server, making it difficult to fix invalid configurations, and could also harm the entire etcd cluster in an HA scenario.
Additional context
A particular challenge is the stack applier, which is used by almost all k0s components to manage resources in the cluster. Currently, there is no good way to suspend a stack to prevent it from being applied. This may become necessary to prevent bad things from happening. Suspending stack reconciliation could be achieved by suspending leader election globally, but this might prevent partial reconciliation as discussed above.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Some k0s configuration options are mutually exclusive, others cannot be changed after cluster creation. Currently, there is minimal sanity checking for dynamic configuration. As a result, unsupported configuration combinations can potentially break a cluster completely. Debugging such a problem is complex, as helpful error messages only appear during cluster creation. See #4721 for an example.
Describe the solution you would like
Prevent invalid configurations from being stored in the cluster:
Add CEL (Common Expression Language) validation rule markers to the various
ClusterConfig
structs. This allows validation to be performed by the API server already, preventing invalid configurations from reaching the cluster in the first place.Graceful handling of unsupported configuration values. Depending on the effectiveness of point 1, there are several options:
If the configuration validation fails, k0s doesn't reconcile the configuration at all, and the components remain on the last valid configuration. This is easy and straightforward to implement. While tempting, this could undermine the reconciliation of other valid and safe configuration parts, and is therefore only a good choice if point 1 is effective, and invalid configurations stored within a cluster can be considered a pathological edge case.
Try to get as close to the desired configuration as possible without breaking the cluster. This might involve "resolving" an invalid desired configuration by comparing it to the last valid one, into a "fixed" target configuration that passes validation and can be safely reconciled. This is a more elaborate and error-prone approach, and may not be necessary if point 1 proves effective.
Describe alternatives you've considered
A validating webhook is a more powerful approach to preventing invalid configurations from being stored in the cluster. At the same time, it's much heavier and more complex. It does not currently add significant value over the simpler CEL approach.
"Let it crash". Terminate the process and wait for a restart. This is not suitable as it will bring down the local API server, making it difficult to fix invalid configurations, and could also harm the entire etcd cluster in an HA scenario.
Additional context
A particular challenge is the stack applier, which is used by almost all k0s components to manage resources in the cluster. Currently, there is no good way to suspend a stack to prevent it from being applied. This may become necessary to prevent bad things from happening. Suspending stack reconciliation could be achieved by suspending leader election globally, but this might prevent partial reconciliation as discussed above.
The text was updated successfully, but these errors were encountered: