Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Otherwise "healthy" nodes getting stuck on a block #183

Open
BenVanGithub opened this issue Jan 27, 2021 · 33 comments
Open

Otherwise "healthy" nodes getting stuck on a block #183

BenVanGithub opened this issue Jan 27, 2021 · 33 comments

Comments

@BenVanGithub
Copy link

Describe the bug
During a 24 hour period, approximately 3% of nodes will stop moving forward with no other signs of distress.

To Reproduce
Steps to reproduce the behavior:
1.) Spin up 100 nodes
2.) get them all running and relaying.
3.) watch them like a hawk

Expected behavior
follow the chain

Additional context

  • OS Ununtu 20.02 - Digital Ocean VPS
    pocket version
  • AppVersion: RC-0.5.2.9
  • "n_peers": "50", (which is correct max number. 25 in 25 out in config.json)
  • "height/round/step": "16970/0/4",

Symptom has been somewhat consistent for last three days. About 3/100 nodes get stuck.
Stopping the client and restarting it succeeds "most of the time".

@BenVanGithub
Copy link
Author

BenVanGithub commented Jan 28, 2021

issue_jx11_consensus.txt
issue_jx11_netinfo.txt
issue_jx11_status.txt
issue_jx11_config.json.txt

This one (jx-11) only has 25 peers, don't know why... still, 25 is 24 more than necessary to run. It just stopped moving froward at 17008 and got jailed a bit later.

@novy4
Copy link

novy4 commented Jan 28, 2021

I have the same with one of our nodes, stuck on 17068 with 48 peers

@andrewnguyen22
Copy link
Contributor

@novy4 what's the ratio of the 48 inbound vs outbound?

@andrewnguyen22
Copy link
Contributor

andrewnguyen22 commented Jan 28, 2021

@BenVanGithub Looks like you have no inbound peers on jx-11... Wondering if this is a different node than the original issue?

"n_peers": "50", (which is correct max number. 25 in 25 out in config.json)...

Could you possibly provide all of the proper debugging docs as you did with jx-11

@BenVanGithub
Copy link
Author

Yes. It is a different node. The original one was a customer node and I restarted it before I could grab the info that you requested. Next time i'll know what to grab before restarting it.

@BenVanGithub
Copy link
Author

In bound peer issue is new and may or may not be related. Seems to happen when I clone a new node from an old one. Doing some tests on my side before making an official issue... might be a process error on how I clone.

@andrewnguyen22
Copy link
Contributor

@BenVanGithub thanks for this. It can be an array of things, but often when a node falls behind it seems to be the inbound peering issue. The only other reason it would fall behind (afaik) is a consensus error, or too long of sleep periods on the gossip configs:

"PeerGossipSleepDuration": 100000000000,
"PeerQueryMaj23SleepDuration": 200000000000

I assume you lowered these after the previous p2p problems

@BenVanGithub
Copy link
Author

Yes. You suggested removing two zeros... but ... I don't follow directions well and I removed only 1.
Did this change to every node I control.

@BenVanGithub
Copy link
Author

here is a full, normal, customer node 50 peers, which has been functioning without issue for several days... just stopped moving forward.

ga02 config.json.txt
ga02 error on restart.txt
ga02 consensusstate.txt
ga02 netinfo.txt
ga02 status.txt
had to kill -9 and got error on restart (also attached)

@andrewnguyen22
Copy link
Contributor

QQ when you restart it did you reset the datadir with a backup? Or just kill and restart?

@BenVanGithub
Copy link
Author

My process is:
try a soft kill, wait about 30 seconds, do a hard kill if needed
try a plain ole... pocket start, do a data remove and replace if the start fails.
Note: I think this is the first time I have had a corrupted data base on restart since 0.5.2.9
Also: the roughly 3 per 100 per 24 hours has maintained and it's not the same machines over and over. I will start keeping a list of machines and sticks just to confirm that.

@andrewnguyen22
Copy link
Contributor

@BenVanGithub was it corrupted or Validator set not found for height x

@BenVanGithub
Copy link
Author

panic: enterPrecommit: +2/3 prevoted for an invalid block: wrong Block.Header.AppHash.
https://github.com/pokt-network/pocket-core-deployments/files/5891380/ga02.error.on.restart.txt

@BenVanGithub
Copy link
Author

UPDATE: Going on 3 days without any nodes getting stuck.

@andrewnguyen22
Copy link
Contributor

Hey just wanted to ping here and say that we are still investigating potential causes, but this seems to be related to a corrupted datadir...

panic: enterPrecommit: +2/3 prevoted for an invalid block: wrong Block.Header.AppHash.
https://github.com/pokt-network/pocket-core-deployments/files/5891380/ga02.error.on.restart.txt

At your convenience, can you describe your node backup process?

@BenVanGithub
Copy link
Author

Yes.. But let's clarify..
1.) node stops moving forward
2.) node wont respond to soft kill
3.) kill node with -9
4.) restart with "pocket start"
5.) get error shown ( 1 time out of 20)
6.) reload from an existing node ( no backup involved, just grab a good node's data)
7.) all is well

@andrewnguyen22
Copy link
Contributor

Yeah it's part 6) that I'm worried about. I'm assuming that existing node is 'alive' when you take the backup... that's the same thing as unclean shutdown, cause those databases can be mid write. I'm not saying that this is the reason for all of your corruption, but I do think this is a contributor. Common errors associated with unclean shutdown:

validator set not found at height:

@BenVanGithub
Copy link
Author

BenVanGithub commented Feb 1, 2021

I hear you.. but it doesn't add up...
your saying that fixing node "A" is what is causing node "B" to break. But node "B" is not broken.
Surely hard killing the already malfunctioning node "A" is a much more likely cause of it's own corruption.

Also... I am aware that pulling live data can sometimes fail to get a good copy... but that's not the case here.. step 7 is good.

@andrewnguyen22
Copy link
Contributor

@BenVanGithub I don't understand what this means:

your saying that fixing node "A" is what is causing node "B" to break. But node "B" is not broken.

To be clear, I'm suggesting that your 'backup process' (or bootstrap process in your case) can cause problems. Just cause step 7) works for 24+ hours doesn't mean something happened to the DB along the way and many blocks later when the database is accessed there could be an apphash error.

With that being said, I'm fully investigating relations to this issue: pokt-network/pocket-core#1230
And also other potential problems with the RC-0.5.2.9 release causing intermittent consensus failures

@BenVanGithub
Copy link
Author

OK.. I get what you are saying now.
Sorry.. I'm not just slow... I'm also opinionated and defensive sometimes.
Yes.. quite a few nodes required re-bootstrap during the 5.2.9 upgrade, the symptoms are congruent with some percentage of them having bad data copies that didn't fail right away.
The current lack of failures also supports that explanation.
Thank you.

@andrewnguyen22
Copy link
Contributor

Yeah no problem at all, thanks for the patience. Will update on this issue

@BenVanGithub
Copy link
Author

Screenshot from 2021-02-06 18-28-24
Started watching this one a full hour before it stopped moving... It was showing high CPU which is not common on 0.5.2.9
kill and restart, all good, no corruption

@Garandor
Copy link

Garandor commented Feb 8, 2021

image

Can corroborate. Full peers (20 in 20 out), high CPU and Disk IO before the last block - and the profile looks different from the normal blocks (CPU and disk take a longer time than normal on the last 22:28 block production)
image

Interestingly the last block produced time at 22:28, at the next block I get
enterPrevote: ProposalBlock is nil

and then the consensus module does not log any more messages.
image

@andrewnguyen22
Copy link
Contributor

As an update/bump on this issue, we are investigating the possibility of incompatibilities between Tendermint versions

@BenVanGithub
Copy link
Author

BenVanGithub commented Feb 12, 2021

Update: Thursday: After about 2 days low failure rate. Today was another high failure (3/100) day and they sometimes get stuck in batches on the same block. It would be very hard to explain batch failure as a node with corrupted data even if the original copy for all of the nodes was the same. You could make a case for it, but it would be difficult.
Update: Friday: Block # 18378 stopped 2% of nodes, none of them showed resource spikes, the all just "missed" the next block and got stuck.

@andrewnguyen22
Copy link
Contributor

Update: Confirmation from @BenVanGithub that after 7+ days with this change: pokt-network/pocket-core#1237
only 2 out of 600+ nodes stuck (for possibly unrelated reasons).

@Garandor
Copy link

Garandor commented Mar 2, 2021

UPDATE: Nevermind, the node got stuck again today

Anecdotal datapoint: Changing IN/OUT to 30/10 did not prevent my node from getting stuck about once every day (yellow bar visible on hang in screenshot)

Deleting the address book of the node and letting it rebuild seems to have done the trick though. Been running since 2/27 with no hangs ( using 30/10 node configuration )
image

@Garandor
Copy link

Garandor commented Mar 2, 2021

One idea I would like to propose to prevent the forming of subnetwork "cliques" of nodes that get stuck together:

Have the node rotate connected nodes by disconnecting from the oldest or one random node each block (at least when the node has maxed out connections).
That way it would reconnect to a random new node from the address book, with a chance that this node will not be stuck.
If I understand the communication correctly, only one node of each clique needs to find a healthy node for the whole clique to become "unstuck".

Once per block gives each node 6 chances to find another working node before getting jailed, which should greatly improve resilience to this problem.

@BenVanGithub
Copy link
Author

On a healthy network 30/10 would stabilize at 10/10 actual usage.
There may be several issues at play here but the two big factors are obvious to me.
1.) The network is massively OVER connected and drowning in its own gossip.
2.) Some nodes are locking up HUGE numbers of incoming slots and not providing equal numbers of good-peer slots in return. These Pocket seed nodes are the easiest bad actors to identify, but I'm building a data base to isolate who/what the other culprits are:
This node has been up for 3 days, 9 out of 30 available inbound slots are used up by these silly seed nodes! Just shut them down! Nodes only need seeds once in their entire life span, and even then they only need one.
Screenshot from 2021-03-02 19-37-09
Also: @Garandor Yes. A random node drop would be healthy, but it would have to also remove that node from the address book or it will just reconnect.

@BenVanGithub
Copy link
Author

Update:
There appears to be two kinds of "stuck".
1.) Soft stuck - - when killed and restarted will move forward again [aprox 95%]
2.) Hard stuck = = when killed and restarted will NOT move forward [5%]
Leaving a note here to inform that removing session.db will allow a hard stuck node to move forward again without having to do a full reset.

@andrewnguyen22
Copy link
Contributor

andrewnguyen22 commented Mar 22, 2021

Update:
There appears to be two kinds of "stuck".
1.) Soft stuck - - when killed and restarted will move forward again [aprox 95%]
2.) Hard stuck = = when killed and restarted will NOT move forward [5%]
Leaving a note here to inform that removing session.db will allow a hard stuck node to move forward again without having to do a full reset.

You're saying that removing the session db turns a hard stuck into a soft stuck?

How do soft stucks relate to ctr+c?

@BenVanGithub
Copy link
Author

Sorry for the delayed response.
Yes. Removing session.db seems to turn a hard stuck into a soft stuck.

I don't have any good data on the ctr+c issue.

@andrewnguyen22
Copy link
Contributor

I would like to bump this to RC-0.6.1 and higher only after the upgrade height meaning we are seeking evidence of this issue on the latest releases POST upgrade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants