-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Otherwise "healthy" nodes getting stuck on a block #183
Comments
issue_jx11_consensus.txt This one (jx-11) only has 25 peers, don't know why... still, 25 is 24 more than necessary to run. It just stopped moving froward at 17008 and got jailed a bit later. |
I have the same with one of our nodes, stuck on 17068 with 48 peers |
@novy4 what's the ratio of the 48 inbound vs outbound? |
@BenVanGithub Looks like you have no inbound peers on jx-11... Wondering if this is a different node than the original issue?
Could you possibly provide all of the proper debugging docs as you did with jx-11 |
Yes. It is a different node. The original one was a customer node and I restarted it before I could grab the info that you requested. Next time i'll know what to grab before restarting it. |
In bound peer issue is new and may or may not be related. Seems to happen when I clone a new node from an old one. Doing some tests on my side before making an official issue... might be a process error on how I clone. |
@BenVanGithub thanks for this. It can be an array of things, but often when a node falls behind it seems to be the inbound peering issue. The only other reason it would fall behind (afaik) is a consensus error, or too long of sleep periods on the gossip configs:
I assume you lowered these after the previous p2p problems |
Yes. You suggested removing two zeros... but ... I don't follow directions well and I removed only 1. |
here is a full, normal, customer node 50 peers, which has been functioning without issue for several days... just stopped moving forward. ga02 config.json.txt |
QQ when you restart it did you reset the datadir with a backup? Or just kill and restart? |
My process is: |
@BenVanGithub was it corrupted or |
panic: enterPrecommit: +2/3 prevoted for an invalid block: wrong Block.Header.AppHash. |
UPDATE: Going on 3 days without any nodes getting stuck. |
Hey just wanted to ping here and say that we are still investigating potential causes, but this seems to be related to a corrupted datadir...
At your convenience, can you describe your node backup process? |
Yes.. But let's clarify.. |
Yeah it's part 6) that I'm worried about. I'm assuming that existing node is 'alive' when you take the backup... that's the same thing as unclean shutdown, cause those databases can be mid write. I'm not saying that this is the reason for all of your corruption, but I do think this is a contributor. Common errors associated with unclean shutdown:
|
I hear you.. but it doesn't add up... Also... I am aware that pulling live data can sometimes fail to get a good copy... but that's not the case here.. step 7 is good. |
@BenVanGithub I don't understand what this means:
To be clear, I'm suggesting that your 'backup process' (or bootstrap process in your case) can cause problems. Just cause step 7) works for 24+ hours doesn't mean something happened to the DB along the way and many blocks later when the database is accessed there could be an apphash error. With that being said, I'm fully investigating relations to this issue: pokt-network/pocket-core#1230 |
OK.. I get what you are saying now. |
Yeah no problem at all, thanks for the patience. Will update on this issue |
As an update/bump on this issue, we are investigating the possibility of incompatibilities between Tendermint versions |
Update: Thursday: After about 2 days low failure rate. Today was another high failure (3/100) day and they sometimes get stuck in batches on the same block. It would be very hard to explain batch failure as a node with corrupted data even if the original copy for all of the nodes was the same. You could make a case for it, but it would be difficult. |
Update: Confirmation from @BenVanGithub that after 7+ days with this change: pokt-network/pocket-core#1237 |
One idea I would like to propose to prevent the forming of subnetwork "cliques" of nodes that get stuck together: Have the node rotate connected nodes by disconnecting from the oldest or one random node each block (at least when the node has maxed out connections). Once per block gives each node 6 chances to find another working node before getting jailed, which should greatly improve resilience to this problem. |
On a healthy network 30/10 would stabilize at 10/10 actual usage. |
Update: |
You're saying that removing the session db turns a hard stuck into a soft stuck? How do soft stucks relate to ctr+c? |
Sorry for the delayed response. I don't have any good data on the ctr+c issue. |
I would like to bump this to RC-0.6.1 and higher only after the upgrade height meaning we are seeking evidence of this issue on the latest releases POST upgrade |
Describe the bug
During a 24 hour period, approximately 3% of nodes will stop moving forward with no other signs of distress.
To Reproduce
Steps to reproduce the behavior:
1.) Spin up 100 nodes
2.) get them all running and relaying.
3.) watch them like a hawk
Expected behavior
follow the chain
Additional context
pocket version
Symptom has been somewhat consistent for last three days. About 3/100 nodes get stuck.
Stopping the client and restarting it succeeds "most of the time".
The text was updated successfully, but these errors were encountered: