Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Even after Concensus failure health check was showing pass #1039

Open
Staking7pc opened this issue Dec 3, 2021 · 0 comments
Open

Even after Concensus failure health check was showing pass #1039

Staking7pc opened this issue Dec 3, 2021 · 0 comments

Comments

@Staking7pc
Copy link

Description/Reasoning

Our node ran into a problem with the 'too many open files issue'. When this happens there was a consensus failure and during that health check did not fail

Current Behaviour

We relied on health checks and axelard status for our alerting. Even though consensus failed we were not getting alerted
We were jailed during this period.

Expected Behaviour

When there is a consensus failure the health check should not say passed for axelard?

Steps to reproduce (for bugs)

Relevant Logs or Files

2021-12-01T13:30:33Z   INF  received complete proposal block  hash= 883699B4974EA2EFD8674FDBB58D5613197072A32BF364731321F51F66BED25D  height= 189903  module= consensus
 2021-12-01T13:30:33Z    ERR   CONSENSUS FAILURE!!!  err= "open /root/.axelar_testnet/.core/data/write-file-atomic-07343618385475708275: too many open files"  module= consensus  stack= "goroutine 893 [running]:\nruntime/debug.Stack(0xc02687b050, 0x2015080, 0xc010132000)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9f\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine.func2(0xc000d72a80, 0x2521850)\n\t/go/pkg/mod/github.com/tendermint/[email protected]/consensus/state.go:726 +0x5b\npanic(0x2015080, 0xc010132000)\n\t/usr/local/go/src/runtime/panic.go:965 +0x1b9\ngithub.com/tendermint/tendermint/privval

The max no of open files is 1024. Should have increased it as I think it's recommended to have 4k for TM.
I am not sure if it's supposed to trigger consensus failure

The axelar core process did not crash and it continues to show catching up false

{
  "latest_block_hash": "EE08DFCBEB43230D8A117817ABC9459AFE292FB93192F3366284AEB3CBB65E24",
  "latest_app_hash": "4D791A881E1A92D7DD97A290B8BF0F667291DDFC56A957732970700DAB24AF1F",
  "latest_block_height": "189902",
  "latest_block_time": "2021-12-01T13:30:23.121017242Z",
  "earliest_block_hash": "0BD5E5A3C7E900091E577CCF84FAF87D08BF4E71D1CB9B95B24C0F096C652290",
  "earliest_app_hash": "E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855",
  "earliest_block_height": "1",
  "earliest_block_time": "2021-11-19T00:58:50.037626565Z",
  "catching_up": false
}

health check continues to say passed

axelard health-check --tofnd-host localhost  tofnd --operator-addr axelarvaloper1wfwdw00wz232stjeagsjeztq2nsrmx35yzuht8 --node http://localhost:26657/
tofnd check: passed
broadcaster check: passed
operator check: passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant