Otherwise "healthy" nodes getting stuck on a block #183

BenVanGithub · 2021-01-27T14:31:19Z

Describe the bug
During a 24 hour period, approximately 3% of nodes will stop moving forward with no other signs of distress.

To Reproduce
Steps to reproduce the behavior:
1.) Spin up 100 nodes
2.) get them all running and relaying.
3.) watch them like a hawk

Expected behavior
follow the chain

Additional context

OS Ununtu 20.02 - Digital Ocean VPS
pocket version
AppVersion: RC-0.5.2.9
"n_peers": "50", (which is correct max number. 25 in 25 out in config.json)
"height/round/step": "16970/0/4",

Symptom has been somewhat consistent for last three days. About 3/100 nodes get stuck.
Stopping the client and restarting it succeeds "most of the time".

BenVanGithub · 2021-01-28T00:07:09Z

issue_jx11_consensus.txt
issue_jx11_netinfo.txt
issue_jx11_status.txt
issue_jx11_config.json.txt

This one (jx-11) only has 25 peers, don't know why... still, 25 is 24 more than necessary to run. It just stopped moving froward at 17008 and got jailed a bit later.

novy4 · 2021-01-28T16:04:00Z

I have the same with one of our nodes, stuck on 17068 with 48 peers

andrewnguyen22 · 2021-01-28T16:11:46Z

@novy4 what's the ratio of the 48 inbound vs outbound?

andrewnguyen22 · 2021-01-28T16:13:28Z

@BenVanGithub Looks like you have no inbound peers on jx-11... Wondering if this is a different node than the original issue?

"n_peers": "50", (which is correct max number. 25 in 25 out in config.json)...

Could you possibly provide all of the proper debugging docs as you did with jx-11

BenVanGithub · 2021-01-28T16:20:18Z

Yes. It is a different node. The original one was a customer node and I restarted it before I could grab the info that you requested. Next time i'll know what to grab before restarting it.

BenVanGithub · 2021-01-28T16:23:10Z

In bound peer issue is new and may or may not be related. Seems to happen when I clone a new node from an old one. Doing some tests on my side before making an official issue... might be a process error on how I clone.

andrewnguyen22 · 2021-01-28T16:40:04Z

@BenVanGithub thanks for this. It can be an array of things, but often when a node falls behind it seems to be the inbound peering issue. The only other reason it would fall behind (afaik) is a consensus error, or too long of sleep periods on the gossip configs:

"PeerGossipSleepDuration": 100000000000,
"PeerQueryMaj23SleepDuration": 200000000000

I assume you lowered these after the previous p2p problems

BenVanGithub · 2021-01-28T17:14:47Z

Yes. You suggested removing two zeros... but ... I don't follow directions well and I removed only 1.
Did this change to every node I control.

BenVanGithub · 2021-01-29T03:18:03Z

here is a full, normal, customer node 50 peers, which has been functioning without issue for several days... just stopped moving forward.

ga02 config.json.txt
ga02 error on restart.txt
ga02 consensusstate.txt
ga02 netinfo.txt
ga02 status.txt
had to kill -9 and got error on restart (also attached)

andrewnguyen22 · 2021-01-29T13:16:56Z

QQ when you restart it did you reset the datadir with a backup? Or just kill and restart?

BenVanGithub · 2021-01-29T17:31:56Z

My process is:
try a soft kill, wait about 30 seconds, do a hard kill if needed
try a plain ole... pocket start, do a data remove and replace if the start fails.
Note: I think this is the first time I have had a corrupted data base on restart since 0.5.2.9
Also: the roughly 3 per 100 per 24 hours has maintained and it's not the same machines over and over. I will start keeping a list of machines and sticks just to confirm that.

andrewnguyen22 · 2021-01-29T18:03:35Z

@BenVanGithub was it corrupted or Validator set not found for height x

BenVanGithub · 2021-01-29T20:02:48Z

panic: enterPrecommit: +2/3 prevoted for an invalid block: wrong Block.Header.AppHash.
https://github.com/pokt-network/pocket-core-deployments/files/5891380/ga02.error.on.restart.txt

BenVanGithub · 2021-02-01T18:42:18Z

UPDATE: Going on 3 days without any nodes getting stuck.

andrewnguyen22 · 2021-02-01T19:21:54Z

Hey just wanted to ping here and say that we are still investigating potential causes, but this seems to be related to a corrupted datadir...

panic: enterPrecommit: +2/3 prevoted for an invalid block: wrong Block.Header.AppHash.
https://github.com/pokt-network/pocket-core-deployments/files/5891380/ga02.error.on.restart.txt

At your convenience, can you describe your node backup process?

BenVanGithub · 2021-02-01T22:19:00Z

Yes.. But let's clarify..
1.) node stops moving forward
2.) node wont respond to soft kill
3.) kill node with -9
4.) restart with "pocket start"
5.) get error shown ( 1 time out of 20)
6.) reload from an existing node ( no backup involved, just grab a good node's data)
7.) all is well

andrewnguyen22 · 2021-02-01T23:02:20Z

Yeah it's part 6) that I'm worried about. I'm assuming that existing node is 'alive' when you take the backup... that's the same thing as unclean shutdown, cause those databases can be mid write. I'm not saying that this is the reason for all of your corruption, but I do think this is a contributor. Common errors associated with unclean shutdown:

validator set not found at height:

BenVanGithub · 2021-02-01T23:11:25Z

I hear you.. but it doesn't add up...
your saying that fixing node "A" is what is causing node "B" to break. But node "B" is not broken.
Surely hard killing the already malfunctioning node "A" is a much more likely cause of it's own corruption.

Also... I am aware that pulling live data can sometimes fail to get a good copy... but that's not the case here.. step 7 is good.

andrewnguyen22 · 2021-02-02T00:58:29Z

@BenVanGithub I don't understand what this means:

your saying that fixing node "A" is what is causing node "B" to break. But node "B" is not broken.

To be clear, I'm suggesting that your 'backup process' (or bootstrap process in your case) can cause problems. Just cause step 7) works for 24+ hours doesn't mean something happened to the DB along the way and many blocks later when the database is accessed there could be an apphash error.

With that being said, I'm fully investigating relations to this issue: pokt-network/pocket-core#1230
And also other potential problems with the RC-0.5.2.9 release causing intermittent consensus failures

BenVanGithub · 2021-02-02T01:44:03Z

OK.. I get what you are saying now.
Sorry.. I'm not just slow... I'm also opinionated and defensive sometimes.
Yes.. quite a few nodes required re-bootstrap during the 5.2.9 upgrade, the symptoms are congruent with some percentage of them having bad data copies that didn't fail right away.
The current lack of failures also supports that explanation.
Thank you.

andrewnguyen22 · 2021-02-02T20:14:26Z

Yeah no problem at all, thanks for the patience. Will update on this issue

BenVanGithub · 2021-02-06T23:36:35Z

Started watching this one a full hour before it stopped moving... It was showing high CPU which is not common on 0.5.2.9
kill and restart, all good, no corruption

Garandor · 2021-02-08T05:22:49Z

Can corroborate. Full peers (20 in 20 out), high CPU and Disk IO before the last block - and the profile looks different from the normal blocks (CPU and disk take a longer time than normal on the last 22:28 block production)

Interestingly the last block produced time at 22:28, at the next block I get
enterPrevote: ProposalBlock is nil

and then the consensus module does not log any more messages.

andrewnguyen22 · 2021-02-10T20:49:49Z

As an update/bump on this issue, we are investigating the possibility of incompatibilities between Tendermint versions

BenVanGithub · 2021-02-12T06:39:05Z

Update: Thursday: After about 2 days low failure rate. Today was another high failure (3/100) day and they sometimes get stuck in batches on the same block. It would be very hard to explain batch failure as a node with corrupted data even if the original copy for all of the nodes was the same. You could make a case for it, but it would be difficult.
Update: Friday: Block # 18378 stopped 2% of nodes, none of them showed resource spikes, the all just "missed" the next block and got stuck.

andrewnguyen22 · 2021-02-19T17:40:15Z

Update: Confirmation from @BenVanGithub that after 7+ days with this change: pokt-network/pocket-core#1237
only 2 out of 600+ nodes stuck (for possibly unrelated reasons).

Garandor · 2021-03-02T05:35:59Z

UPDATE: Nevermind, the node got stuck again today

Anecdotal datapoint: Changing IN/OUT to 30/10 did not prevent my node from getting stuck about once every day (yellow bar visible on hang in screenshot)

Deleting the address book of the node and letting it rebuild seems to have done the trick though. Been running since 2/27 with no hangs ( using 30/10 node configuration )

Garandor · 2021-03-02T22:15:23Z

One idea I would like to propose to prevent the forming of subnetwork "cliques" of nodes that get stuck together:

Have the node rotate connected nodes by disconnecting from the oldest or one random node each block (at least when the node has maxed out connections).
That way it would reconnect to a random new node from the address book, with a chance that this node will not be stuck.
If I understand the communication correctly, only one node of each clique needs to find a healthy node for the whole clique to become "unstuck".

Once per block gives each node 6 chances to find another working node before getting jailed, which should greatly improve resilience to this problem.

BenVanGithub · 2021-03-03T00:53:11Z

On a healthy network 30/10 would stabilize at 10/10 actual usage.
There may be several issues at play here but the two big factors are obvious to me.
1.) The network is massively OVER connected and drowning in its own gossip.
2.) Some nodes are locking up HUGE numbers of incoming slots and not providing equal numbers of good-peer slots in return. These Pocket seed nodes are the easiest bad actors to identify, but I'm building a data base to isolate who/what the other culprits are:
This node has been up for 3 days, 9 out of 30 available inbound slots are used up by these silly seed nodes! Just shut them down! Nodes only need seeds once in their entire life span, and even then they only need one.

Also: @Garandor Yes. A random node drop would be healthy, but it would have to also remove that node from the address book or it will just reconnect.

BenVanGithub · 2021-03-22T03:50:55Z

Update:
There appears to be two kinds of "stuck".
1.) Soft stuck - - when killed and restarted will move forward again [aprox 95%]
2.) Hard stuck = = when killed and restarted will NOT move forward [5%]
Leaving a note here to inform that removing session.db will allow a hard stuck node to move forward again without having to do a full reset.

andrewnguyen22 · 2021-03-22T21:25:03Z

Update:
There appears to be two kinds of "stuck".
1.) Soft stuck - - when killed and restarted will move forward again [aprox 95%]
2.) Hard stuck = = when killed and restarted will NOT move forward [5%]
Leaving a note here to inform that removing session.db will allow a hard stuck node to move forward again without having to do a full reset.

You're saying that removing the session db turns a hard stuck into a soft stuck?

How do soft stucks relate to ctr+c?

BenVanGithub · 2021-03-27T12:39:07Z

Sorry for the delayed response.
Yes. Removing session.db seems to turn a hard stuck into a soft stuck.

I don't have any good data on the ctr+c issue.

andrewnguyen22 · 2021-04-15T16:14:43Z

I would like to bump this to RC-0.6.1 and higher only after the upgrade height meaning we are seeking evidence of this issue on the latest releases POST upgrade

BenVanGithub closed this as completed Jan 28, 2021

BenVanGithub reopened this Jan 28, 2021

Garandor mentioned this issue Feb 8, 2021

cmd + C or CTRL + C, not stopping pocket. pokt-network/pocket-core#1230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Otherwise "healthy" nodes getting stuck on a block #183

Otherwise "healthy" nodes getting stuck on a block #183

BenVanGithub commented Jan 27, 2021

BenVanGithub commented Jan 28, 2021 •

edited

Loading

novy4 commented Jan 28, 2021

andrewnguyen22 commented Jan 28, 2021

andrewnguyen22 commented Jan 28, 2021 •

edited

Loading

BenVanGithub commented Jan 28, 2021

BenVanGithub commented Jan 28, 2021

andrewnguyen22 commented Jan 28, 2021

BenVanGithub commented Jan 28, 2021

BenVanGithub commented Jan 29, 2021

andrewnguyen22 commented Jan 29, 2021

BenVanGithub commented Jan 29, 2021

andrewnguyen22 commented Jan 29, 2021

BenVanGithub commented Jan 29, 2021

BenVanGithub commented Feb 1, 2021

andrewnguyen22 commented Feb 1, 2021

BenVanGithub commented Feb 1, 2021

andrewnguyen22 commented Feb 1, 2021

BenVanGithub commented Feb 1, 2021 •

edited

Loading

andrewnguyen22 commented Feb 2, 2021

BenVanGithub commented Feb 2, 2021

andrewnguyen22 commented Feb 2, 2021

BenVanGithub commented Feb 6, 2021

Garandor commented Feb 8, 2021 •

edited

Loading

andrewnguyen22 commented Feb 10, 2021

BenVanGithub commented Feb 12, 2021 •

edited

Loading

andrewnguyen22 commented Feb 19, 2021

Garandor commented Mar 2, 2021 •

edited

Loading

Garandor commented Mar 2, 2021 •

edited

Loading

BenVanGithub commented Mar 3, 2021

BenVanGithub commented Mar 22, 2021

andrewnguyen22 commented Mar 22, 2021 •

edited

Loading

BenVanGithub commented Mar 27, 2021

andrewnguyen22 commented Apr 15, 2021

Otherwise "healthy" nodes getting stuck on a block #183

Otherwise "healthy" nodes getting stuck on a block #183

Comments

BenVanGithub commented Jan 27, 2021

BenVanGithub commented Jan 28, 2021 • edited Loading

novy4 commented Jan 28, 2021

andrewnguyen22 commented Jan 28, 2021

andrewnguyen22 commented Jan 28, 2021 • edited Loading

BenVanGithub commented Jan 28, 2021

BenVanGithub commented Jan 28, 2021

andrewnguyen22 commented Jan 28, 2021

BenVanGithub commented Jan 28, 2021

BenVanGithub commented Jan 29, 2021

andrewnguyen22 commented Jan 29, 2021

BenVanGithub commented Jan 29, 2021

andrewnguyen22 commented Jan 29, 2021

BenVanGithub commented Jan 29, 2021

BenVanGithub commented Feb 1, 2021

andrewnguyen22 commented Feb 1, 2021

BenVanGithub commented Feb 1, 2021

andrewnguyen22 commented Feb 1, 2021

BenVanGithub commented Feb 1, 2021 • edited Loading

andrewnguyen22 commented Feb 2, 2021

BenVanGithub commented Feb 2, 2021

andrewnguyen22 commented Feb 2, 2021

BenVanGithub commented Feb 6, 2021

Garandor commented Feb 8, 2021 • edited Loading

andrewnguyen22 commented Feb 10, 2021

BenVanGithub commented Feb 12, 2021 • edited Loading

andrewnguyen22 commented Feb 19, 2021

Garandor commented Mar 2, 2021 • edited Loading

Garandor commented Mar 2, 2021 • edited Loading

BenVanGithub commented Mar 3, 2021

BenVanGithub commented Mar 22, 2021

andrewnguyen22 commented Mar 22, 2021 • edited Loading

BenVanGithub commented Mar 27, 2021

andrewnguyen22 commented Apr 15, 2021

BenVanGithub commented Jan 28, 2021 •

edited

Loading

andrewnguyen22 commented Jan 28, 2021 •

edited

Loading

BenVanGithub commented Feb 1, 2021 •

edited

Loading

Garandor commented Feb 8, 2021 •

edited

Loading

BenVanGithub commented Feb 12, 2021 •

edited

Loading

Garandor commented Mar 2, 2021 •

edited

Loading

Garandor commented Mar 2, 2021 •

edited

Loading

andrewnguyen22 commented Mar 22, 2021 •

edited

Loading