You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Whilst it is critical that validators maintain their rpc endpoints to ensure they are able to participate in signing for all the chains they are signalling on, it is possible that an endpoint could fail due to events outside their control. For example, a chain halt may cause rpc nodes to non-respond or respond incorrectly. When the network grows to the point that validators are maintaining dozens of chains it would be unfortunate for a validator to be unable to restart vald without causing disruption to all chains they are supporting, due to a failure on a single rpc endpoint.
Current Behaviour
Vald panics and crashes at startup if one of its configured endpoints does not connect and return an expected response
Expected Behaviour
Vald should be able to handle an issue with one or more endpoints gracefully and produce clear log output highlighting which rpc endpoints have problems (this should be repeated regularly until the error is resolved). This would enable a validator to still participate in keygens on its other chains. If the fault was due to an external event such as a chain halt it would also mean that as soon as rpc functionality was restored the validator would be able to participate immediately. This means that a validator doesn’t have to go through the trouble of altering configs on the fly whenever there is an issue with an external chain, and would not suffer any delay in resuming service once the chain is active again.
It would be useful to augment this with a prometheus metric that outputs the status of all the configured endpoints, eg: axelar_vald_external_chain_status{chain=”ethereum”} 1
where there is a label for each chain configured and 0 is returned if there is an issue, 1 if the connection is healthy.
Steps to reproduce (for bugs)
Relevant Logs or Files
The text was updated successfully, but these errors were encountered:
Description/Reasoning
Whilst it is critical that validators maintain their rpc endpoints to ensure they are able to participate in signing for all the chains they are signalling on, it is possible that an endpoint could fail due to events outside their control. For example, a chain halt may cause rpc nodes to non-respond or respond incorrectly. When the network grows to the point that validators are maintaining dozens of chains it would be unfortunate for a validator to be unable to restart vald without causing disruption to all chains they are supporting, due to a failure on a single rpc endpoint.
Current Behaviour
Vald panics and crashes at startup if one of its configured endpoints does not connect and return an expected response
Expected Behaviour
Vald should be able to handle an issue with one or more endpoints gracefully and produce clear log output highlighting which rpc endpoints have problems (this should be repeated regularly until the error is resolved). This would enable a validator to still participate in keygens on its other chains. If the fault was due to an external event such as a chain halt it would also mean that as soon as rpc functionality was restored the validator would be able to participate immediately. This means that a validator doesn’t have to go through the trouble of altering configs on the fly whenever there is an issue with an external chain, and would not suffer any delay in resuming service once the chain is active again.
It would be useful to augment this with a prometheus metric that outputs the status of all the configured endpoints, eg:
axelar_vald_external_chain_status{chain=”ethereum”} 1
where there is a label for each chain configured and 0 is returned if there is an issue, 1 if the connection is healthy.
Steps to reproduce (for bugs)
Relevant Logs or Files
The text was updated successfully, but these errors were encountered: