[WIP] Feature: watchdog-fencing: drop auto-calculation of stonith-watchdog-… #3677

wenningerk · 2024-10-02T16:18:42Z

…timeout

The implementation wasn't safe in a way that it didn't prevent a mismatch of stonith-watchdog-timeout and SBD_WATCHDOG_TIMEOUT on certain nodes when SBD_WATCHDOG_TIMEOUT wasn't configured to the same value on all nodes.
And now it is possible to verify the value as valid interval spec.

wenningerk · 2024-10-02T17:10:51Z

This is WIP. Code is looking good a a first glance but I haven't done any test with it. So maybe don't invest your time. I would like to complete the documentation stuff. For now I just removed the reference to the auto-calculation. Do you think we need to add that pacemaker will of course start on nodes where sbd isn't active if that is configured via fence_watchdog? Or is it clear enough from what we have?
@gao-yan can you pls. check which part of the Chinese text I would have to remove/modify. Would be cool if you could give me the line numbers - if that is enough.

kgaillot

looking good

doc/sphinx/Pacemaker_Explained/cluster-options.rst

lib/common/options.c

lib/common/watchdog.c

po/zh_CN.po

kgaillot · 2024-10-02T20:00:16Z

Do you think we need to add that pacemaker will of course start on nodes where sbd isn't active if that is configured via fence_watchdog? Or is it clear enough from what we have?

I think anyone using fence_watchdog will be following some documentation for it. That would be good to add to the man page though.

gao-yan

Would it make sense to exchange the watchdog timeouts between the nodes and still do the automatic calculation but based on the maximum?

kgaillot · 2024-10-03T15:08:59Z

Would it make sense to exchange the watchdog timeouts between the nodes and still do the automatic calculation but based on the maximum?

If there's a partition (especially for example at start-up), it's the value on the nodes you can't reach that would be relevant

gao-yan · 2024-10-04T06:29:12Z

Would it make sense to exchange the watchdog timeouts between the nodes and still do the automatic calculation but based on the maximum?

If there's a partition (especially for example at start-up), it's the value on the nodes you can't reach that would be relevant

If their watchdog timeouts are unknown, they must have been partitioned since the beginning. In that case, they won't self-reset anyway but OTOH they won't host resources if they are not quorate.

wenningerk · 2024-10-07T18:37:13Z

Being a bit more verbose now in the fence_watchdog man-page mentioning that it can be used for limitation of watchdog-fencing to certain nodes + skip daemon & watchdog-consistency check on others.

kgaillot · 2024-10-07T19:35:26Z

Would it make sense to exchange the watchdog timeouts between the nodes and still do the automatic calculation but based on the maximum?

If there's a partition (especially for example at start-up), it's the value on the nodes you can't reach that would be relevant

If their watchdog timeouts are unknown, they must have been partitioned since the beginning. In that case, they won't self-reset anyway but OTOH they won't host resources if they are not quorate.

Good points. I'm still wary of autocalculation though. Too much opportunity for timing issues if nodes are joining and leaving. Plus there's the added complication of the fencer needing the value but the controller having the only access to remote nodes. @wenningerk , what do you think?

wenningerk · 2024-10-08T05:05:14Z

Good points. I'm still wary of autocalculation though. Too much opportunity for timing issues if nodes are joining and leaving. Plus there's the added complication of the fencer needing the value but the controller having the only access to remote nodes. @wenningerk , what do you think?

The settings of the hardware-watchdog-timeout on the nodes isn't anything that changes dynamically. And if we have a change it is probably unintentional and we have our safety-belt for that.
Of course collecting the timeouts from the devices that show up with the quorate partition and moving the effective stonith-watchdog-timeout up if it doesn't match anymore could be done. But as Ken says we have to pass the info around between daemons. Races due to the info not spread while a resource started by pacemaker is already up and there is already a reason to fence are maybe not that likely. With resources we find running doing our initial probes things might become more critical maybe.
Why don't we think of passing the issue to the high-level-tooling? Could be either with the setup of SBD on the nodes (do we have that already with SBD - have to check ...) or as an action the admin triggers at a point in time where the whole cluster is available and where the high-level-tool can go over the nodes and collect the data - and maybe moan/fail if one of the nodes isn't available. For smooth operation the latter might be done prior to enabling watchdog-fencing. But even later we could trim it down from a high value to a smaller but still save value. The decision whether to take a node into account or not might be an issue though. Don't know if high-level-tooling is doing anything similar already. crm_simulate?

gao-yan · 2024-10-08T09:59:28Z

Good points. I'm still wary of autocalculation though. Too much opportunity for timing issues if nodes are joining and leaving. Plus there's the added complication of the fencer needing the value but the controller having the only access to remote nodes. @wenningerk , what do you think?

The settings of the hardware-watchdog-timeout on the nodes isn't anything that changes dynamically. And if we have a change it is probably unintentional and we have our safety-belt for that. Of course collecting the timeouts from the devices that show up with the quorate partition and moving the effective stonith-watchdog-timeout up if it doesn't match anymore could be done. But as Ken says we have to pass the info around between daemons. Races due to the info not spread while a resource started by pacemaker is already up and there is already a reason to fence are maybe not that likely. With resources we find running doing our initial probes things might become more critical maybe.

An idea could be for the nodes to report, store and exchange their watchdog timeouts as an internal node attribute through attrd... To be on the safe side, whether a specific node can be fenced via watchdog fencing depends on whether its watchdog timeout is known to us in this case, otherwise for instance node_does_watchdog_fencing() returns false? Eventually it could also technically mean that each node might have its own practical stonith-watchdog-time...

kgaillot · 2024-10-08T15:06:48Z

An idea could be for the nodes to report, store and exchange their watchdog timeouts as an internal node attribute through attrd... To be on the safe side, whether a specific node can be fenced via watchdog fencing depends on whether its watchdog timeout is known to us in this case, otherwise for instance node_does_watchdog_fencing() returns false? Eventually it could also technically mean that each node might have its own practical stonith-watchdog-time...

A node-specific watchdog timeout sounds interesting. If we know the value for each node, there's no reason to wait the maximum. I wouldn't want to require the value to be known in order to enable watchdog fencing; the cluster option could be the default (and so ideally set to the maximum by the user).

Currently the fencer doesn't have an attribute manager connection, though it does search the CIB for node attributes in some cases. Also, a transient attribute would get cleared when the node leaves the cluster, which defeats the purpose, but we don't normally set permanent attributes dynamically. It might make more sense as an XML attribute of node_status.

Something like:

The controller updates node_status for the local node at start-up after connecting to the CIB.
The controller updates node_status for remote nodes after connecting to them.
The fencer reads the node_status values from the CIB.
The values are never deleted, so we always have the last known value.
If a node uses watchdog fencing but has no value in node_status, the cluster option is used. The cluster option acts as a default and also enables watchdog fencing in general, but local values will always dynamically override it once known.

gao-yan · 2024-10-08T16:20:37Z

An idea could be for the nodes to report, store and exchange their watchdog timeouts as an internal node attribute through attrd... To be on the safe side, whether a specific node can be fenced via watchdog fencing depends on whether its watchdog timeout is known to us in this case, otherwise for instance node_does_watchdog_fencing() returns false? Eventually it could also technically mean that each node might have its own practical stonith-watchdog-time...

A node-specific watchdog timeout sounds interesting. If we know the value for each node, there's no reason to wait the maximum. I wouldn't want to require the value to be known in order to enable watchdog fencing;

I meant for the case of stonith-watchdog-timeout with a negative value, requiring a node's watchdog timeout to be known in order to enable watchdog fencing for the specific node. I was thinking not to break such an existing setting (hence the compatibility). So that the purpose could still be served, meaning users wouldn't have to know the watchdog timeouts or how to choose a proper value for stonith-watchdog-timeout.

the cluster option could be the default (and so ideally set to the maximum by the user).

Currently the fencer doesn't have an attribute manager connection, though it does search the CIB for node attributes in some cases. Also, a transient attribute would get cleared when the node leaves the cluster, which defeats the purpose, but we don't normally set permanent attributes dynamically. It might make more sense as an XML attribute of node_status.

To be precise, "node_state", right?

Something like:

The controller updates node_status for the local node at start-up after connecting to the CIB.

The controller updates node_status for remote nodes after connecting to them.

The fencer reads the node_status values from the CIB.

The values are never deleted, so we always have the last known value.

If a node uses watchdog fencing but has no value in node_status, the cluster option is used. The cluster option acts as a default and also enables watchdog fencing in general, but local values will always dynamically override it once known.

Good ideas.

kgaillot · 2024-10-08T17:17:47Z

A node-specific watchdog timeout sounds interesting. If we know the value for each node, there's no reason to wait the maximum. I wouldn't want to require the value to be known in order to enable watchdog fencing;

I meant for the case of stonith-watchdog-timeout with a negative value, requiring a node's watchdog timeout to be known in order to enable watchdog fencing for the specific node. I was thinking not to break such an existing setting (hence the compatibility). So that the purpose could still be served, meaning users wouldn't have to know the watchdog timeouts or how to choose a proper value for stonith-watchdog-timeout.

The purpose of 3.0.0 is to collect worthwhile compatibility breaks. I'm definitely trying to minimize the impact on real-world configurations by focusing mainly on behavior that has long been deprecated and undocumented. We included stonith-watchdog-timeout mainly because the current dynamic calculation is potentially unsafe, but also it's the only interval option that accepts negative values, and little inconsistencies like always that cause annoyances at some point (like keeping us from using the standard validation functions, confusing user, etc.).

If we track per-node values, then safety becomes largely a non-issue. The calculation becomes always dynamic, and stonith-watchdog-timeout becomes a default used only before a local value is first known for a newly added node.

With safety no longer a significant issue, should we still drop negative values? It would still be nice to have consistency across interval options, but I'm not sure that's worth breaking configurations for. Of course, we have to actually implement per-node values, but that can be done anytime (not just a major release).

What if we used 1 instead of negative to mean auto-calculate the default (which again would eventually be used only when a node is first added)? It would be easy to transform negative to 1 to avoid breaking configurations, and 1s would never be a realistic value for a watchdog timeout (hopefully? or use 1ms to be absolutely sure?).

the cluster option could be the default (and so ideally set to the maximum by the user).
Currently the fencer doesn't have an attribute manager connection, though it does search the CIB for node attributes in some cases. Also, a transient attribute would get cleared when the node leaves the cluster, which defeats the purpose, but we don't normally set permanent attributes dynamically. It might make more sense as an XML attribute of node_status.

To be precise, "node_state", right?

oops, yes

Something like:

The controller updates node_status for the local node at start-up after connecting to the CIB.

The controller updates node_status for remote nodes after connecting to them.

The fencer reads the node_status values from the CIB.

The values are never deleted, so we always have the last known value.

If a node uses watchdog fencing but has no value in node_status, the cluster option is used. The cluster option acts as a default and also enables watchdog fencing in general, but local values will always dynamically override it once known.

Good ideas.

…timeout The implementation wasn't safe in a way that it didn't prevent a mismatch of stonith-watchdog-timeout and SBD_WATCHDOG_TIMEOUT on certain nodes when SBD_WATCHDOG_TIMEOUT wasn't configured to the same value on all nodes.

gao-yan · 2024-10-09T14:14:22Z

I meant for the case of stonith-watchdog-timeout with a negative value, requiring a node's watchdog timeout to be known in order to enable watchdog fencing for the specific node. I was thinking not to break such an existing setting (hence the compatibility). So that the purpose could still be served, meaning users wouldn't have to know the watchdog timeouts or how to choose a proper value for stonith-watchdog-timeout.

The purpose of 3.0.0 is to collect worthwhile compatibility breaks. I'm definitely trying to minimize the impact on real-world configurations by focusing mainly on behavior that has long been deprecated and undocumented. We included stonith-watchdog-timeout mainly because the current dynamic calculation is potentially unsafe, but also it's the only interval option that accepts negative values, and little inconsistencies like always that cause annoyances at some point (like keeping us from using the standard validation functions, confusing user, etc.).

Good point.

If we track per-node values, then safety becomes largely a non-issue. The calculation becomes always dynamic, and stonith-watchdog-timeout becomes a default used only before a local value is first known for a newly added node.

Indeed.

With safety no longer a significant issue, should we still drop negative values? It would still be nice to have consistency across interval options, but I'm not sure that's worth breaking configurations for. Of course, we have to actually implement per-node values, but that can be done anytime (not just a major release).

What if we used 1 instead of negative to mean auto-calculate the default (which again would eventually be used only when a node is first added)? It would be easy to transform negative to 1 to avoid breaking configurations, and 1s would never be a realistic value for a watchdog timeout (hopefully? or use 1ms to be absolutely sure?).

Probably we could generalize the handling for misconfigurations. We could do a calculation anyway for the fencing target upon fencing query, and tolerate any too small value by applying the calculated one. In that case, it wouldn't require a special value for auto calculation.

Then for upgrade of CIB syntax, we'd probably rather transform negative to a sensible value depending on whether SBD_WATCHDOG_TIMEOUT env is known in the context, otherwise a generic default such as 10s?

kgaillot · 2024-10-09T18:56:20Z

Probably we could generalize the handling for misconfigurations. We could do a calculation anyway for the fencing target upon fencing query, and tolerate any too small value by applying the calculated one. In that case, it wouldn't require a special value for auto calculation.

Right, we could use the maximum of stonith-watchdog-timeout and twice the target's SBD_WATCHDOG_TIMEOUT. That way someone can force a longer wait if they want, or they can configure any low value to always use the calculated timeout.

Then for upgrade of CIB syntax, we'd probably rather transform negative to a sensible value depending on whether SBD_WATCHDOG_TIMEOUT env is known in the context, otherwise a generic default such as 10s?

Unfortunately the XSL transform knows only the CIB XML, and there's no way to provide any other info. Also transforms can be done without even having a live cluster or being on a cluster node (saved CIBs).

With per-node calculation implemented, any generic default would be fine. Unfortunately due to time constraints, we're looking at implementing that sometime in the future, keeping the current method for 3.0.0. That's why we'd still need a special value to mean use the current method in the 3.0.0 time frame. But 1s would work fine for that -- even after we implement per-node values, 1s could still mean use twice the local value as the global default, and it wouldn't be risky because it would only be usable before the first time a node joins.

kgaillot · 2024-10-09T19:00:47Z

But 1s would work fine for that -- even after we implement per-node values, 1s could still mean use twice the local value as the global default, and it wouldn't be risky because it would only be usable before the first time a node joins.

There is the question of rolling upgrades. Overall they would be fine since the schema version and feature set will be bumped for 3.0.0 -- the new schema (enforcing a nonnegative value, and transforming negative values to the new magic value) wouldn't be used until all nodes were upgraded. However someone could set a stonith-watchdog-timeout of 1s (or whatever the magic value is) during the rolling upgrade, and that would be interpreted differently by nodes of different versions. I think that could just be a "don't do that" type of situation.

gao-yan · 2024-10-18T11:30:25Z

Should we probably deprecate negative values already for instance in v2.1.9?

wenningerk · 2024-10-21T11:12:49Z

If I got the above right the approach we are tending towards would be that if we read stonith-watchdog-timeout to be smaller than 2 * SBD_WATCHDOG_TIMEOUT we would not as we are doing today (if stonith-watchdog-timeout > SBD_WATCHDOG_TIMEOUT) refuse to start or stop pacemaker but adjust a per-node timeout to 2 * SBD_WATCHDOG_TIMEOUT.
This would as well be helpful in those cases where we've gotten it wrong once and there is no easy way to correct as pacemaker won't stay up for long enough to set a new stonith-watchdog-timeout.
2 * SBD_WATCHDOG_TIMEOUT being the safe configuration we should still leave people - who have thought about it carefully - the opportunity to go down till SBD_WATCHDOG_TIMEOUT. This would require an additional configuration possibility.
If we want to have the pattern of cluster property stonith-watchdog-timeout being something like a safe fallback some automated decrease behavior might as well be interesting.
This would leave us with 2 configuration flags - something like automatic-increase & automatic-decrease.
Touching the whole thing and intending to introduce per-node-values - at least for internal use - would we like to be able to configure something on a per-node-base as well? Like when a node attribute of stontih-watchdog-timeout or stonith-watchdog-automatic-decrease/increase is present that this would overrule the cluster-property ...

kgaillot · 2024-10-21T16:36:26Z

I'm leaning to doing nothing for 2.1.9/3.0.0 due to time constraints and an unclear end goal.

For the future, maybe something like:

Track node-specific STONITH_WATCHDOG_TIMEOUT as a node_state attribute
Rename stonith-watchdog-timeout to something like watchdog-fencing-duration, and use standard duration type (nonnegative interval specs)
Add new option(s) for auto-calculating the value, something like watchdog-fencing-auto=true/false/increase/decrease where true = use twice the target-specific value (or twice the local value if a target-specific value is not yet known), false = use stonith-watchdog-timeout exactly, increase = use higher of stonith-watchdog-timeout or twice the target-specific value, decrease = use lower of stonith-watchdog-timeout or twice the target-specific value
When we eventually drop stonith-watchdog-timeout, an XSL transform would be straightforward (-1 -> watchdog-fencing-duration=0 and watchdog-fencing-auto=true)

wenningerk requested review from kgaillot and gao-yan October 2, 2024 17:05

kgaillot reviewed Oct 2, 2024

View reviewed changes

doc/sphinx/Pacemaker_Explained/cluster-options.rst Outdated Show resolved Hide resolved

lib/common/options.c Outdated Show resolved Hide resolved

lib/common/watchdog.c Outdated Show resolved Hide resolved

po/zh_CN.po Outdated Show resolved Hide resolved

gao-yan reviewed Oct 3, 2024

View reviewed changes

wenningerk force-pushed the validate_watchdog_timeout branch 2 times, most recently from d9fd8a5 to 5bb16c5 Compare October 7, 2024 18:33

wenningerk force-pushed the validate_watchdog_timeout branch from 5bb16c5 to 4738dec Compare October 8, 2024 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Feature: watchdog-fencing: drop auto-calculation of stonith-watchdog-… #3677

[WIP] Feature: watchdog-fencing: drop auto-calculation of stonith-watchdog-… #3677

wenningerk commented Oct 2, 2024

wenningerk commented Oct 2, 2024

kgaillot left a comment

kgaillot commented Oct 2, 2024

gao-yan left a comment

kgaillot commented Oct 3, 2024

gao-yan commented Oct 4, 2024 •

edited

Loading

wenningerk commented Oct 7, 2024

kgaillot commented Oct 7, 2024

wenningerk commented Oct 8, 2024

gao-yan commented Oct 8, 2024

kgaillot commented Oct 8, 2024

gao-yan commented Oct 8, 2024

kgaillot commented Oct 8, 2024

gao-yan commented Oct 9, 2024

kgaillot commented Oct 9, 2024

kgaillot commented Oct 9, 2024

gao-yan commented Oct 18, 2024

wenningerk commented Oct 21, 2024 •

edited

Loading

kgaillot commented Oct 21, 2024

[WIP] Feature: watchdog-fencing: drop auto-calculation of stonith-watchdog-… #3677

Are you sure you want to change the base?

[WIP] Feature: watchdog-fencing: drop auto-calculation of stonith-watchdog-… #3677

Conversation

wenningerk commented Oct 2, 2024

wenningerk commented Oct 2, 2024

kgaillot left a comment

Choose a reason for hiding this comment

kgaillot commented Oct 2, 2024

gao-yan left a comment

Choose a reason for hiding this comment

kgaillot commented Oct 3, 2024

gao-yan commented Oct 4, 2024 • edited Loading

wenningerk commented Oct 7, 2024

kgaillot commented Oct 7, 2024

wenningerk commented Oct 8, 2024

gao-yan commented Oct 8, 2024

kgaillot commented Oct 8, 2024

gao-yan commented Oct 8, 2024

kgaillot commented Oct 8, 2024

gao-yan commented Oct 9, 2024

kgaillot commented Oct 9, 2024

kgaillot commented Oct 9, 2024

gao-yan commented Oct 18, 2024

wenningerk commented Oct 21, 2024 • edited Loading

kgaillot commented Oct 21, 2024

gao-yan commented Oct 4, 2024 •

edited

Loading

wenningerk commented Oct 21, 2024 •

edited

Loading