Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error: "rtnetlink socket closed" #5428

Open
wjmelements opened this issue Mar 18, 2024 · 11 comments
Open

error: "rtnetlink socket closed" #5428

wjmelements opened this issue Mar 18, 2024 · 11 comments
Labels
blocked bug Something isn't working Networking

Comments

@wjmelements
Copy link

Description

Lighthouse is not connecting to geth (v1.13.12) and floods the logs with error messages about the rtnetlink socket being closed

Version

stable v5.1.1

Present Behaviour

The logs is only this error, repeated

WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Right(Io(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })) }) })) }, service: libp2p

It does not recover. Lighthouse only spams this log repeatedly.

Geth config:

--authrpc.addr localhost --authrpc.port 8551 --authrpc.jwtsecret /shared_ebs/geth_data/geth/jwtsecret

Lighthouse config:

--execution-endpoint http://localhost:8551 --execution-jwt /shared_ebs/geth_data/geth/jwtsecret

Expected Behaviour

Healthy node

Steps to resolve

Unknown

@michaelsproul
Copy link
Member

Try killing Lighthouse with kill -9 or similar, then restart it.

I think this is a bug in libp2p where the listener gets interrupted by a shutdown signal during startup and then goes into an infinite loop. See:

Related:

If Lighthouse doesn't restart after the kill there might be something else that triggered the initial shutdown, e.g. low disk space.

@michaelsproul michaelsproul added bug Something isn't working blocked labels Mar 21, 2024
@wjmelements
Copy link
Author

wjmelements commented Mar 21, 2024

then restart it

The issue happens again on restart. I am using systemd to manage the service which does sigkill if sigint doesn't stop it after long enough.

low disk space

The disk has 3TB of extra space.

I'm going to try to get the logs from when it starts happening next time, but it is challenging because the log-flood eventually cycles those logs.

@michaelsproul
Copy link
Member

If you could DM some debug logs from $datadir/beacon/logs then I'll have a look at why it might be shutting down

@michaelsproul
Copy link
Member

I can't glean much from the debug logs, it just shows Lighthouse eventually shutting down due to the SIGTERM and then the issue being resolved after the restart:

Mar 18 20:28:26.757 INFO Shutting down.., reason: Success("Received SIGTERM"), module: lighthouse:696

Mar 18 20:28:42.133 INFO Lighthouse started, version: Lighthouse/v5.1.1-2a3c709, module: lighthouse:621

The debug logs filled up with warnings pretty quickly, so the 5 files only covered:

  • lots of Listener error logs starting Mar 18 20:27:20.336
  • shutdown due to SIGTERM: Mar 18 20:28:26.757
  • errors continuing after SIGTERM and prior to Lighthouse starting again
  • Lighthouse starts, without errors: Mar 18 20:28:42.133

So there's only ~1 minute of history covered by the logs.

@AgeManning It looks like we expect some listener errors:

SwarmEvent::ListenerError { error, .. } => {
// this is non fatal, but we still check
warn!(self.log, "Listener error"; "error" => ?error);
if Swarm::listeners(&self.swarm).count() == 0 {
Some(NetworkEvent::ZeroListeners)
} else {
None
}
}

I guess we can't change that event handler to shutdown immediately because some listener errors are non-fatal? e.g. listening on 0.0.0.0 causes LH to try listening on some random interface that doesn't work, if we have other interfaces we can safely ignore this sort of failure.

@wjmelements
Copy link
Author

It happened again but I don't have logs from when it started because it produces 1 million lines of errorlogs every 2 minutes and so my logs cycle out before I even notice it went down.

@michaelsproul
Copy link
Member

@wjmelements Do you want to try modifying Lighthouse so it just crashes on the lines I linked above? If you replace the current warning by panic!() it will at least kill things quickly without spamming the logs and we might be able to see what's going on a bit better

@AgeManning
Copy link
Member

AgeManning commented Mar 24, 2024

Yeah. So the reason we tolerate the listen errors is that generally libp2p will try and listen on a number of interfaces.
i.e:
ip4 tcp
ip6 tcp
ip4 udp (quic)
ip6 udp (quic)

The host might not have ip6 support or some tcp port failure. Each individual listen error should not be a catastrophic failure, as we can fail on ip6 but succeed on ip4 (a common case).

I'll reach out to Michael and try and have a look at the logs. As of 5.1 we now store libp2p logs in the .lighthouse//beacon_node/logs directory. There might be useful information there.

@AgeManning
Copy link
Member

Also including @jxs who may be interested in this

@ackintosh
Copy link
Member

@wjmelements It might be helpful to check your system logs (e.g. /var/log/messages) for any errors related to network interfaces.

The rtnetlink socket closed error is emmitted by if-watch (linux.rs), a component that allows users to manage network interface events, such as bringing interfaces up or down. rtnetlink serves a mechanism for retrieving network information from the kernel, facilitated through a specific socket communucation channel between the kernel and user programs. This error could indicate that an issue has arisen within the socket communication to the kernel.

@wjmelements
Copy link
Author

I have logs from where it starts, on holesky:

Apr 08 18:17:06.001 INFO Synced                                  slot: 1391485, block:    …  empty, epoch: 43483, finalized_epoch: 43481, finalized_root: 0x2b42…9fd0, exec_hash: 0x89e3…f847 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:14.353 INFO New block received                      root: 0x09d30077b578042c8ce3a2f1a39e5a191544dcf25f1aa45dfd719f2b1aab0236, slot: 1391486
Apr 08 18:17:18.001 INFO Synced                                  slot: 1391486, block: 0x09d3…0236, epoch: 43483, finalized_epoch: 43481, finalized_root: 0x2b42…9fd0, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:30.004 INFO Synced                                  slot: 1391487, block:    …  empty, epoch: 43483, finalized_epoch: 43481, finalized_root: 0x2b42…9fd0, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:42.316 INFO Synced                                  slot: 1391488, block:    …  empty, epoch: 43484, finalized_epoch: 43482, finalized_root: 0x5c57…7c7d, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:53.266 WARN Snapshot cache miss for blob verification, index: 4, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:17:53.348 WARN Snapshot cache miss for blob verification, index: 5, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:17:53.733 WARN Snapshot cache miss for blob verification, index: 1, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:17:53.823 WARN Snapshot cache miss for blob verification, index: 2, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:17:54.000 INFO Synced                                  slot: 1391489, block:    …  empty, epoch: 43484, finalized_epoch: 43482, finalized_root: 0x5c57…7c7d, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:55.251 WARN Snapshot cache miss for blob verification, index: 0, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:18:51.433 INFO New block received                      root: 0x08b5a63f9c883498a8fc00628bb5bd0390d2715cbe068ffeea9c26fdabf365b1, slot: 1391489
Apr 08 18:19:26.923 INFO Synced                                  slot: 1391497, block:    …  empty, epoch: 43484, finalized_epoch: 43482, finalized_root: 0x5c57…7c7d, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:19:26.928 WARN Execution engine call failed            error: HttpClient(url: http://localhost:6551/, kind: timeout, detail: operation timed out), service: exec
Apr 08 18:19:26.932 WARN Error whilst processing payload status  error: Api { error: HttpClient(url: http://localhost:6551/, kind: timeout, detail: operation timed out) }, service: exec
Apr 08 18:19:27.820 WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Left(Other(Left(Left(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })))) }) })) }, service: libp2p
Apr 08 18:19:27.825 WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Right(Io(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })) }) })) }, service: libp2p

after which the same log repeats, mostly Right with the occasional Left. The Left here is shown between two rights.

Apr 08 18:19:27.872 WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Right(Io(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })) }) })) }, service: libp2p
Apr 08 18:19:27.924 WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Left(Other(Left(Left(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })))) }) })) }, service: libp2p
Apr 08 18:19:27.934 WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Right(Io(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })) }) })) }, service: libp2p

@ackintosh
Copy link
Member

ackintosh commented Apr 10, 2024

FYI: I have submitted a PR to if-watch to improve the rtnetlink socket closed error message by including specific reasons. This should help us to understand why the error occurs.

libp2p/if-watch#39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked bug Something isn't working Networking
Projects
None yet
Development

No branches or pull requests

4 participants