Vald panics and crashes at startup if one of its configured endpoints does not connect and return an expected response #1015

Pete-LunaNova · 2021-11-26T18:07:43Z

Description/Reasoning

Whilst it is critical that validators maintain their rpc endpoints to ensure they are able to participate in signing for all the chains they are signalling on, it is possible that an endpoint could fail due to events outside their control. For example, a chain halt may cause rpc nodes to non-respond or respond incorrectly. When the network grows to the point that validators are maintaining dozens of chains it would be unfortunate for a validator to be unable to restart vald without causing disruption to all chains they are supporting, due to a failure on a single rpc endpoint.

Current Behaviour

Vald panics and crashes at startup if one of its configured endpoints does not connect and return an expected response

Expected Behaviour

Vald should be able to handle an issue with one or more endpoints gracefully and produce clear log output highlighting which rpc endpoints have problems (this should be repeated regularly until the error is resolved). This would enable a validator to still participate in keygens on its other chains. If the fault was due to an external event such as a chain halt it would also mean that as soon as rpc functionality was restored the validator would be able to participate immediately. This means that a validator doesn’t have to go through the trouble of altering configs on the fly whenever there is an issue with an external chain, and would not suffer any delay in resuming service once the chain is active again.

It would be useful to augment this with a prometheus metric that outputs the status of all the configured endpoints, eg:
axelar_vald_external_chain_status{chain=”ethereum”} 1
where there is a label for each chain configured and 0 is returned if there is an issue, 1 if the connection is healthy.

Steps to reproduce (for bugs)

Relevant Logs or Files

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vald panics and crashes at startup if one of its configured endpoints does not connect and return an expected response #1015

Vald panics and crashes at startup if one of its configured endpoints does not connect and return an expected response #1015

Pete-LunaNova commented Nov 26, 2021

Vald panics and crashes at startup if one of its configured endpoints does not connect and return an expected response #1015

Vald panics and crashes at startup if one of its configured endpoints does not connect and return an expected response #1015

Comments

Pete-LunaNova commented Nov 26, 2021

Description/Reasoning

Current Behaviour

Expected Behaviour

Steps to reproduce (for bugs)

Relevant Logs or Files