-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Feature: watchdog-fencing: drop auto-calculation of stonith-watchdog-… #3677
base: main
Are you sure you want to change the base?
Conversation
This is WIP. Code is looking good a a first glance but I haven't done any test with it. So maybe don't invest your time. I would like to complete the documentation stuff. For now I just removed the reference to the auto-calculation. Do you think we need to add that pacemaker will of course start on nodes where sbd isn't active if that is configured via fence_watchdog? Or is it clear enough from what we have? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good
I think anyone using fence_watchdog will be following some documentation for it. That would be good to add to the man page though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to exchange the watchdog timeouts between the nodes and still do the automatic calculation but based on the maximum?
If there's a partition (especially for example at start-up), it's the value on the nodes you can't reach that would be relevant |
If their watchdog timeouts are unknown, they must have been partitioned since the beginning. In that case, they won't self-reset anyway but OTOH they won't host resources if they are not quorate. |
d9fd8a5
to
5bb16c5
Compare
Being a bit more verbose now in the fence_watchdog man-page mentioning that it can be used for limitation of watchdog-fencing to certain nodes + skip daemon & watchdog-consistency check on others. |
Good points. I'm still wary of autocalculation though. Too much opportunity for timing issues if nodes are joining and leaving. Plus there's the added complication of the fencer needing the value but the controller having the only access to remote nodes. @wenningerk , what do you think? |
The settings of the hardware-watchdog-timeout on the nodes isn't anything that changes dynamically. And if we have a change it is probably unintentional and we have our safety-belt for that. |
An idea could be for the nodes to report, store and exchange their watchdog timeouts as an internal node attribute through attrd... To be on the safe side, whether a specific node can be fenced via watchdog fencing depends on whether its watchdog timeout is known to us in this case, otherwise for instance node_does_watchdog_fencing() returns false? Eventually it could also technically mean that each node might have its own practical |
A node-specific watchdog timeout sounds interesting. If we know the value for each node, there's no reason to wait the maximum. I wouldn't want to require the value to be known in order to enable watchdog fencing; the cluster option could be the default (and so ideally set to the maximum by the user). Currently the fencer doesn't have an attribute manager connection, though it does search the CIB for node attributes in some cases. Also, a transient attribute would get cleared when the node leaves the cluster, which defeats the purpose, but we don't normally set permanent attributes dynamically. It might make more sense as an XML attribute of Something like:
|
I meant for the case of stonith-watchdog-timeout with a negative value, requiring a node's watchdog timeout to be known in order to enable watchdog fencing for the specific node. I was thinking not to break such an existing setting (hence the compatibility). So that the purpose could still be served, meaning users wouldn't have to know the watchdog timeouts or how to choose a proper value for stonith-watchdog-timeout.
To be precise, "node_state", right?
Good ideas. |
The purpose of 3.0.0 is to collect worthwhile compatibility breaks. I'm definitely trying to minimize the impact on real-world configurations by focusing mainly on behavior that has long been deprecated and undocumented. We included If we track per-node values, then safety becomes largely a non-issue. The calculation becomes always dynamic, and With safety no longer a significant issue, should we still drop negative values? It would still be nice to have consistency across interval options, but I'm not sure that's worth breaking configurations for. Of course, we have to actually implement per-node values, but that can be done anytime (not just a major release). What if we used 1 instead of negative to mean auto-calculate the default (which again would eventually be used only when a node is first added)? It would be easy to transform negative to 1 to avoid breaking configurations, and 1s would never be a realistic value for a watchdog timeout (hopefully? or use 1ms to be absolutely sure?).
oops, yes
|
…timeout The implementation wasn't safe in a way that it didn't prevent a mismatch of stonith-watchdog-timeout and SBD_WATCHDOG_TIMEOUT on certain nodes when SBD_WATCHDOG_TIMEOUT wasn't configured to the same value on all nodes.
5bb16c5
to
4738dec
Compare
Good point.
Indeed.
Probably we could generalize the handling for misconfigurations. We could do a calculation anyway for the fencing target upon fencing query, and tolerate any too small value by applying the calculated one. In that case, it wouldn't require a special value for auto calculation. Then for upgrade of CIB syntax, we'd probably rather transform negative to a sensible value depending on whether SBD_WATCHDOG_TIMEOUT env is known in the context, otherwise a generic default such as 10s? |
Right, we could use the maximum of stonith-watchdog-timeout and twice the target's SBD_WATCHDOG_TIMEOUT. That way someone can force a longer wait if they want, or they can configure any low value to always use the calculated timeout.
Unfortunately the XSL transform knows only the CIB XML, and there's no way to provide any other info. Also transforms can be done without even having a live cluster or being on a cluster node (saved CIBs). With per-node calculation implemented, any generic default would be fine. Unfortunately due to time constraints, we're looking at implementing that sometime in the future, keeping the current method for 3.0.0. That's why we'd still need a special value to mean use the current method in the 3.0.0 time frame. But 1s would work fine for that -- even after we implement per-node values, 1s could still mean use twice the local value as the global default, and it wouldn't be risky because it would only be usable before the first time a node joins. |
There is the question of rolling upgrades. Overall they would be fine since the schema version and feature set will be bumped for 3.0.0 -- the new schema (enforcing a nonnegative value, and transforming negative values to the new magic value) wouldn't be used until all nodes were upgraded. However someone could set a stonith-watchdog-timeout of 1s (or whatever the magic value is) during the rolling upgrade, and that would be interpreted differently by nodes of different versions. I think that could just be a "don't do that" type of situation. |
Should we probably deprecate negative values already for instance in v2.1.9? |
If I got the above right the approach we are tending towards would be that if we read stonith-watchdog-timeout to be smaller than 2 * SBD_WATCHDOG_TIMEOUT we would not as we are doing today (if stonith-watchdog-timeout > SBD_WATCHDOG_TIMEOUT) refuse to start or stop pacemaker but adjust a per-node timeout to 2 * SBD_WATCHDOG_TIMEOUT. |
I'm leaning to doing nothing for 2.1.9/3.0.0 due to time constraints and an unclear end goal. For the future, maybe something like:
|
…timeout
The implementation wasn't safe in a way that it didn't prevent a mismatch of stonith-watchdog-timeout and SBD_WATCHDOG_TIMEOUT on certain nodes when SBD_WATCHDOG_TIMEOUT wasn't configured to the same value on all nodes.
And now it is possible to verify the value as valid interval spec.