Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Feature: watchdog-fencing: drop auto-calculation of stonith-watchdog-… #3677

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wenningerk
Copy link
Contributor

…timeout

The implementation wasn't safe in a way that it didn't prevent a mismatch of stonith-watchdog-timeout and SBD_WATCHDOG_TIMEOUT on certain nodes when SBD_WATCHDOG_TIMEOUT wasn't configured to the same value on all nodes.
And now it is possible to verify the value as valid interval spec.

@wenningerk
Copy link
Contributor Author

This is WIP. Code is looking good a a first glance but I haven't done any test with it. So maybe don't invest your time. I would like to complete the documentation stuff. For now I just removed the reference to the auto-calculation. Do you think we need to add that pacemaker will of course start on nodes where sbd isn't active if that is configured via fence_watchdog? Or is it clear enough from what we have?
@gao-yan can you pls. check which part of the Chinese text I would have to remove/modify. Would be cool if you could give me the line numbers - if that is enough.

Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good

doc/sphinx/Pacemaker_Explained/cluster-options.rst Outdated Show resolved Hide resolved
lib/common/options.c Outdated Show resolved Hide resolved
lib/common/watchdog.c Outdated Show resolved Hide resolved
po/zh_CN.po Outdated Show resolved Hide resolved
@kgaillot
Copy link
Contributor

kgaillot commented Oct 2, 2024

Do you think we need to add that pacemaker will of course start on nodes where sbd isn't active if that is configured via fence_watchdog? Or is it clear enough from what we have?

I think anyone using fence_watchdog will be following some documentation for it. That would be good to add to the man page though.

Copy link
Member

@gao-yan gao-yan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to exchange the watchdog timeouts between the nodes and still do the automatic calculation but based on the maximum?

@kgaillot
Copy link
Contributor

kgaillot commented Oct 3, 2024

Would it make sense to exchange the watchdog timeouts between the nodes and still do the automatic calculation but based on the maximum?

If there's a partition (especially for example at start-up), it's the value on the nodes you can't reach that would be relevant

@gao-yan
Copy link
Member

gao-yan commented Oct 4, 2024

Would it make sense to exchange the watchdog timeouts between the nodes and still do the automatic calculation but based on the maximum?

If there's a partition (especially for example at start-up), it's the value on the nodes you can't reach that would be relevant

If their watchdog timeouts are unknown, they must have been partitioned since the beginning. In that case, they won't self-reset anyway but OTOH they won't host resources if they are not quorate.

@wenningerk wenningerk force-pushed the validate_watchdog_timeout branch 2 times, most recently from d9fd8a5 to 5bb16c5 Compare October 7, 2024 18:33
@wenningerk
Copy link
Contributor Author

Being a bit more verbose now in the fence_watchdog man-page mentioning that it can be used for limitation of watchdog-fencing to certain nodes + skip daemon & watchdog-consistency check on others.

@kgaillot
Copy link
Contributor

kgaillot commented Oct 7, 2024

Would it make sense to exchange the watchdog timeouts between the nodes and still do the automatic calculation but based on the maximum?

If there's a partition (especially for example at start-up), it's the value on the nodes you can't reach that would be relevant

If their watchdog timeouts are unknown, they must have been partitioned since the beginning. In that case, they won't self-reset anyway but OTOH they won't host resources if they are not quorate.

Good points. I'm still wary of autocalculation though. Too much opportunity for timing issues if nodes are joining and leaving. Plus there's the added complication of the fencer needing the value but the controller having the only access to remote nodes. @wenningerk , what do you think?

@wenningerk
Copy link
Contributor Author

Good points. I'm still wary of autocalculation though. Too much opportunity for timing issues if nodes are joining and leaving. Plus there's the added complication of the fencer needing the value but the controller having the only access to remote nodes. @wenningerk , what do you think?

The settings of the hardware-watchdog-timeout on the nodes isn't anything that changes dynamically. And if we have a change it is probably unintentional and we have our safety-belt for that.
Of course collecting the timeouts from the devices that show up with the quorate partition and moving the effective stonith-watchdog-timeout up if it doesn't match anymore could be done. But as Ken says we have to pass the info around between daemons. Races due to the info not spread while a resource started by pacemaker is already up and there is already a reason to fence are maybe not that likely. With resources we find running doing our initial probes things might become more critical maybe.
Why don't we think of passing the issue to the high-level-tooling? Could be either with the setup of SBD on the nodes (do we have that already with SBD - have to check ...) or as an action the admin triggers at a point in time where the whole cluster is available and where the high-level-tool can go over the nodes and collect the data - and maybe moan/fail if one of the nodes isn't available. For smooth operation the latter might be done prior to enabling watchdog-fencing. But even later we could trim it down from a high value to a smaller but still save value. The decision whether to take a node into account or not might be an issue though. Don't know if high-level-tooling is doing anything similar already. crm_simulate?

@gao-yan
Copy link
Member

gao-yan commented Oct 8, 2024

Good points. I'm still wary of autocalculation though. Too much opportunity for timing issues if nodes are joining and leaving. Plus there's the added complication of the fencer needing the value but the controller having the only access to remote nodes. @wenningerk , what do you think?

The settings of the hardware-watchdog-timeout on the nodes isn't anything that changes dynamically. And if we have a change it is probably unintentional and we have our safety-belt for that. Of course collecting the timeouts from the devices that show up with the quorate partition and moving the effective stonith-watchdog-timeout up if it doesn't match anymore could be done. But as Ken says we have to pass the info around between daemons. Races due to the info not spread while a resource started by pacemaker is already up and there is already a reason to fence are maybe not that likely. With resources we find running doing our initial probes things might become more critical maybe.

An idea could be for the nodes to report, store and exchange their watchdog timeouts as an internal node attribute through attrd... To be on the safe side, whether a specific node can be fenced via watchdog fencing depends on whether its watchdog timeout is known to us in this case, otherwise for instance node_does_watchdog_fencing() returns false? Eventually it could also technically mean that each node might have its own practical stonith-watchdog-time...

@kgaillot
Copy link
Contributor

kgaillot commented Oct 8, 2024

An idea could be for the nodes to report, store and exchange their watchdog timeouts as an internal node attribute through attrd... To be on the safe side, whether a specific node can be fenced via watchdog fencing depends on whether its watchdog timeout is known to us in this case, otherwise for instance node_does_watchdog_fencing() returns false? Eventually it could also technically mean that each node might have its own practical stonith-watchdog-time...

A node-specific watchdog timeout sounds interesting. If we know the value for each node, there's no reason to wait the maximum. I wouldn't want to require the value to be known in order to enable watchdog fencing; the cluster option could be the default (and so ideally set to the maximum by the user).

Currently the fencer doesn't have an attribute manager connection, though it does search the CIB for node attributes in some cases. Also, a transient attribute would get cleared when the node leaves the cluster, which defeats the purpose, but we don't normally set permanent attributes dynamically. It might make more sense as an XML attribute of node_status.

Something like:

  • The controller updates node_status for the local node at start-up after connecting to the CIB.
  • The controller updates node_status for remote nodes after connecting to them.
  • The fencer reads the node_status values from the CIB.
  • The values are never deleted, so we always have the last known value.
  • If a node uses watchdog fencing but has no value in node_status, the cluster option is used. The cluster option acts as a default and also enables watchdog fencing in general, but local values will always dynamically override it once known.

@gao-yan
Copy link
Member

gao-yan commented Oct 8, 2024

An idea could be for the nodes to report, store and exchange their watchdog timeouts as an internal node attribute through attrd... To be on the safe side, whether a specific node can be fenced via watchdog fencing depends on whether its watchdog timeout is known to us in this case, otherwise for instance node_does_watchdog_fencing() returns false? Eventually it could also technically mean that each node might have its own practical stonith-watchdog-time...

A node-specific watchdog timeout sounds interesting. If we know the value for each node, there's no reason to wait the maximum. I wouldn't want to require the value to be known in order to enable watchdog fencing;

I meant for the case of stonith-watchdog-timeout with a negative value, requiring a node's watchdog timeout to be known in order to enable watchdog fencing for the specific node. I was thinking not to break such an existing setting (hence the compatibility). So that the purpose could still be served, meaning users wouldn't have to know the watchdog timeouts or how to choose a proper value for stonith-watchdog-timeout.

the cluster option could be the default (and so ideally set to the maximum by the user).

Currently the fencer doesn't have an attribute manager connection, though it does search the CIB for node attributes in some cases. Also, a transient attribute would get cleared when the node leaves the cluster, which defeats the purpose, but we don't normally set permanent attributes dynamically. It might make more sense as an XML attribute of node_status.

To be precise, "node_state", right?

Something like:

  • The controller updates node_status for the local node at start-up after connecting to the CIB.
  • The controller updates node_status for remote nodes after connecting to them.
  • The fencer reads the node_status values from the CIB.
  • The values are never deleted, so we always have the last known value.
  • If a node uses watchdog fencing but has no value in node_status, the cluster option is used. The cluster option acts as a default and also enables watchdog fencing in general, but local values will always dynamically override it once known.

Good ideas.

@kgaillot
Copy link
Contributor

kgaillot commented Oct 8, 2024

A node-specific watchdog timeout sounds interesting. If we know the value for each node, there's no reason to wait the maximum. I wouldn't want to require the value to be known in order to enable watchdog fencing;

I meant for the case of stonith-watchdog-timeout with a negative value, requiring a node's watchdog timeout to be known in order to enable watchdog fencing for the specific node. I was thinking not to break such an existing setting (hence the compatibility). So that the purpose could still be served, meaning users wouldn't have to know the watchdog timeouts or how to choose a proper value for stonith-watchdog-timeout.

The purpose of 3.0.0 is to collect worthwhile compatibility breaks. I'm definitely trying to minimize the impact on real-world configurations by focusing mainly on behavior that has long been deprecated and undocumented. We included stonith-watchdog-timeout mainly because the current dynamic calculation is potentially unsafe, but also it's the only interval option that accepts negative values, and little inconsistencies like always that cause annoyances at some point (like keeping us from using the standard validation functions, confusing user, etc.).

If we track per-node values, then safety becomes largely a non-issue. The calculation becomes always dynamic, and stonith-watchdog-timeout becomes a default used only before a local value is first known for a newly added node.

With safety no longer a significant issue, should we still drop negative values? It would still be nice to have consistency across interval options, but I'm not sure that's worth breaking configurations for. Of course, we have to actually implement per-node values, but that can be done anytime (not just a major release).

What if we used 1 instead of negative to mean auto-calculate the default (which again would eventually be used only when a node is first added)? It would be easy to transform negative to 1 to avoid breaking configurations, and 1s would never be a realistic value for a watchdog timeout (hopefully? or use 1ms to be absolutely sure?).

the cluster option could be the default (and so ideally set to the maximum by the user).
Currently the fencer doesn't have an attribute manager connection, though it does search the CIB for node attributes in some cases. Also, a transient attribute would get cleared when the node leaves the cluster, which defeats the purpose, but we don't normally set permanent attributes dynamically. It might make more sense as an XML attribute of node_status.

To be precise, "node_state", right?

oops, yes

Something like:

  • The controller updates node_status for the local node at start-up after connecting to the CIB.
  • The controller updates node_status for remote nodes after connecting to them.
  • The fencer reads the node_status values from the CIB.
  • The values are never deleted, so we always have the last known value.
  • If a node uses watchdog fencing but has no value in node_status, the cluster option is used. The cluster option acts as a default and also enables watchdog fencing in general, but local values will always dynamically override it once known.

Good ideas.

…timeout

The implementation wasn't safe in a way that it didn't prevent
a mismatch of stonith-watchdog-timeout and SBD_WATCHDOG_TIMEOUT
on certain nodes when SBD_WATCHDOG_TIMEOUT wasn't configured
to the same value on all nodes.
@gao-yan
Copy link
Member

gao-yan commented Oct 9, 2024

I meant for the case of stonith-watchdog-timeout with a negative value, requiring a node's watchdog timeout to be known in order to enable watchdog fencing for the specific node. I was thinking not to break such an existing setting (hence the compatibility). So that the purpose could still be served, meaning users wouldn't have to know the watchdog timeouts or how to choose a proper value for stonith-watchdog-timeout.

The purpose of 3.0.0 is to collect worthwhile compatibility breaks. I'm definitely trying to minimize the impact on real-world configurations by focusing mainly on behavior that has long been deprecated and undocumented. We included stonith-watchdog-timeout mainly because the current dynamic calculation is potentially unsafe, but also it's the only interval option that accepts negative values, and little inconsistencies like always that cause annoyances at some point (like keeping us from using the standard validation functions, confusing user, etc.).

Good point.

If we track per-node values, then safety becomes largely a non-issue. The calculation becomes always dynamic, and stonith-watchdog-timeout becomes a default used only before a local value is first known for a newly added node.

Indeed.

With safety no longer a significant issue, should we still drop negative values? It would still be nice to have consistency across interval options, but I'm not sure that's worth breaking configurations for. Of course, we have to actually implement per-node values, but that can be done anytime (not just a major release).

What if we used 1 instead of negative to mean auto-calculate the default (which again would eventually be used only when a node is first added)? It would be easy to transform negative to 1 to avoid breaking configurations, and 1s would never be a realistic value for a watchdog timeout (hopefully? or use 1ms to be absolutely sure?).

Probably we could generalize the handling for misconfigurations. We could do a calculation anyway for the fencing target upon fencing query, and tolerate any too small value by applying the calculated one. In that case, it wouldn't require a special value for auto calculation.

Then for upgrade of CIB syntax, we'd probably rather transform negative to a sensible value depending on whether SBD_WATCHDOG_TIMEOUT env is known in the context, otherwise a generic default such as 10s?

@kgaillot
Copy link
Contributor

kgaillot commented Oct 9, 2024

Probably we could generalize the handling for misconfigurations. We could do a calculation anyway for the fencing target upon fencing query, and tolerate any too small value by applying the calculated one. In that case, it wouldn't require a special value for auto calculation.

Right, we could use the maximum of stonith-watchdog-timeout and twice the target's SBD_WATCHDOG_TIMEOUT. That way someone can force a longer wait if they want, or they can configure any low value to always use the calculated timeout.

Then for upgrade of CIB syntax, we'd probably rather transform negative to a sensible value depending on whether SBD_WATCHDOG_TIMEOUT env is known in the context, otherwise a generic default such as 10s?

Unfortunately the XSL transform knows only the CIB XML, and there's no way to provide any other info. Also transforms can be done without even having a live cluster or being on a cluster node (saved CIBs).

With per-node calculation implemented, any generic default would be fine. Unfortunately due to time constraints, we're looking at implementing that sometime in the future, keeping the current method for 3.0.0. That's why we'd still need a special value to mean use the current method in the 3.0.0 time frame. But 1s would work fine for that -- even after we implement per-node values, 1s could still mean use twice the local value as the global default, and it wouldn't be risky because it would only be usable before the first time a node joins.

@kgaillot
Copy link
Contributor

kgaillot commented Oct 9, 2024

But 1s would work fine for that -- even after we implement per-node values, 1s could still mean use twice the local value as the global default, and it wouldn't be risky because it would only be usable before the first time a node joins.

There is the question of rolling upgrades. Overall they would be fine since the schema version and feature set will be bumped for 3.0.0 -- the new schema (enforcing a nonnegative value, and transforming negative values to the new magic value) wouldn't be used until all nodes were upgraded. However someone could set a stonith-watchdog-timeout of 1s (or whatever the magic value is) during the rolling upgrade, and that would be interpreted differently by nodes of different versions. I think that could just be a "don't do that" type of situation.

@gao-yan
Copy link
Member

gao-yan commented Oct 18, 2024

Should we probably deprecate negative values already for instance in v2.1.9?

@wenningerk
Copy link
Contributor Author

wenningerk commented Oct 21, 2024

If I got the above right the approach we are tending towards would be that if we read stonith-watchdog-timeout to be smaller than 2 * SBD_WATCHDOG_TIMEOUT we would not as we are doing today (if stonith-watchdog-timeout > SBD_WATCHDOG_TIMEOUT) refuse to start or stop pacemaker but adjust a per-node timeout to 2 * SBD_WATCHDOG_TIMEOUT.
This would as well be helpful in those cases where we've gotten it wrong once and there is no easy way to correct as pacemaker won't stay up for long enough to set a new stonith-watchdog-timeout.
2 * SBD_WATCHDOG_TIMEOUT being the safe configuration we should still leave people - who have thought about it carefully - the opportunity to go down till SBD_WATCHDOG_TIMEOUT. This would require an additional configuration possibility.
If we want to have the pattern of cluster property stonith-watchdog-timeout being something like a safe fallback some automated decrease behavior might as well be interesting.
This would leave us with 2 configuration flags - something like automatic-increase & automatic-decrease.
Touching the whole thing and intending to introduce per-node-values - at least for internal use - would we like to be able to configure something on a per-node-base as well? Like when a node attribute of stontih-watchdog-timeout or stonith-watchdog-automatic-decrease/increase is present that this would overrule the cluster-property ...

@kgaillot
Copy link
Contributor

I'm leaning to doing nothing for 2.1.9/3.0.0 due to time constraints and an unclear end goal.

For the future, maybe something like:

  • Track node-specific STONITH_WATCHDOG_TIMEOUT as a node_state attribute
  • Rename stonith-watchdog-timeout to something like watchdog-fencing-duration, and use standard duration type (nonnegative interval specs)
  • Add new option(s) for auto-calculating the value, something like watchdog-fencing-auto=true/false/increase/decrease where true = use twice the target-specific value (or twice the local value if a target-specific value is not yet known), false = use stonith-watchdog-timeout exactly, increase = use higher of stonith-watchdog-timeout or twice the target-specific value, decrease = use lower of stonith-watchdog-timeout or twice the target-specific value
  • When we eventually drop stonith-watchdog-timeout, an XSL transform would be straightforward (-1 -> watchdog-fencing-duration=0 and watchdog-fencing-auto=true)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants