Issues with identify: identify failed to complete #2983

vyzo · 2024-09-27T12:56:42Z

The problem: when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream, with the error indication that identified failed to complete. This happens consistently in our node, and I can trigger it reliably; so there is some bug related to identify.

Relevant logs:

2024-09-27T12:48:24.680Z	DEBUG	basichost	basic/basic_host.go:483	negotiated: /ipfs/id/push/1.0.0 (took 27.52µs)
2024-09-27T12:48:24.680Z	DEBUG	net/identify	identify/id.go:543	/ipfs/id/push/1.0.0 received message from 12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT /ip4/45.139.212.29/udp/17846/quic-v1
2024-09-27T12:48:30.475Z	DEBUG	basichost	basic/basic_host.go:483	negotiated: actor/root/messages/0.0.1 (took 48.66µs)
2024-09-27T12:48:30.479Z	DEBUG	actor	actor/dispatch.go:205	dispatching message from {BAAREICLTJJ63SCP5MUFVGX7GUM42QL2I5E46I355CASAEDIL55VX4SHTA====== did:key:z6MkjYP7Q8m862LdqNtExNMcH1BzgAkYTVp6QC5gHesktKbu {12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT root}} to /public/hello
2024-09-27T12:48:30.479Z	DEBUG	node	node/public.go:59	hello from 12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT
2024-09-27T12:48:30.479Z	DEBUG	swarm2	swarm/swarm.go:475	[12D3KooWFcvZkrs1LTQCdGEVPYi7diCL76Z9f1LcgcMMspujKWTj] opening stream to peer [12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT]
2024-09-27T12:48:33.450Z	WARN	libp2p	libp2p/libp2p.go:380	send: failed to open stream to peer 12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT: identify failed to complete: context deadline exceeded

Version Information

	github.com/libp2p/go-libp2p v0.36.4

The text was updated successfully, but these errors were encountered:

vyzo · 2024-09-27T13:55:40Z

I have worked around it by manually making streams and skipping the identify wait, but there certainly seem to be a bug in identify here.

sukunrt · 2024-10-07T13:52:38Z

Is there any more info you can provide? A trivial connect -> disconnect -> connect test doesn't reproduce this for me.

when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream

Do you mean when you open a stream to the peer after reconnecting, the error is identify failed to complete?

I see DEBUG log about NewStream failing with id failed to complete
Do you get any DEBUG logs from the identify package which are relevant to the peer?

vyzo · 2024-10-07T13:59:24Z

I am trying to understand the problem better, hipefully i can get you a good log package to diagnose this.

But yes, we fail to open streams because identify fails to complete.

Wondertan · 2024-10-07T14:12:39Z

We observed similar behavior when event bus subscriptions were not read fast enough on our side. A client connects and initiates identify; a server processes a new connection in the swarm and blocks never reaching start and thus not processing identify streams. JFYI

MarcoPolo · 2024-10-09T00:16:47Z

when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream

I'm not sure I parse this sentence correctly. I'm understanding it as meaning:

Peer A is connected to peer B.
Peer A disconnects to peer B; then reconnects to peer B.
Peer B attempts to open a stream to peer A.
Peer B fails to open a stream.

Is that understanding correct? Or is it instead:

Peer A is connected to peer B.
Peer A disconnects to peer B; then reconnects to peer B.
Peer A attempts to open a stream to peer B.
Peer A fails to open a stream.

The issue kind of sounds like we aren't picking the best conn, and then we try to identify on it.

vyzo · 2024-10-09T06:42:52Z

It is the first scenario.

MarcoPolo · 2024-10-09T16:24:21Z

ah okay. That makes sense. The issue is probably that Peer B doesn't realize that the "best connection" it picked is actually closed/disconnected. So it times out on waiting for that connection to Identify (it never will).

We can be smarter here and interrupt with an even better connection if a new one appears.

vyzo · 2024-10-09T16:26:40Z

Maybe we should have a collective identify completion channel for the peer, and not one for each conn.

Stebalien · 2024-10-18T23:44:23Z

This relates to #2355 and attaching protocol information to connections instead of peers.

MarcoPolo · 2024-10-21T18:28:28Z

fyi, I have a branch I'm working on that should solve this issue and improve the best connection logic. The basic idea is to create a small new service that subscribes to Identify events and fulfills request to return the best connection for a peer that supports a given protocol and other criteria (e.g. is it a limited connection? Prefer a connection with existing streams). I'll get it pushed soon after the next go-libp2p release.

MarcoPolo · 2024-11-07T21:36:40Z

These comments together:

The problem: when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream, with the error indication that identified failed to complete. This happens consistently in our node, and I can trigger it reliably; so there is some bug related to identify.

and

I have worked around it by manually making streams and skipping the identify wait

Makes me rule out that this is only an issue with a dropped connection, like what I mentioned. Since in that case the workaround of manually making streams should also not work.

My current theory is that this is related to the eventbus being stalled as @Wondertan points out. It would be good to revisit #2361. The main argument against it was "We already have metrics, we don't need metrics AND logs." Looking at this again now, I still think having logs in addition would be nice, since setuping grafana and prometheus is non-trivial (and a big ask of our users), and issues like these should be easier to debug.

I'll make a PR to add this logging. I'll tag vyzo to try it and see if that is indeed their issue.

p-shahi added the kind/bug A bug in existing code (including security flaws) label Sep 27, 2024

MarcoPolo self-assigned this Nov 4, 2024

MarcoPolo mentioned this issue Nov 7, 2024

feat: eventbus: log error on slow consumers #3031

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with identify: identify failed to complete #2983

Issues with identify: identify failed to complete #2983

vyzo commented Sep 27, 2024 •

edited

Loading

vyzo commented Sep 27, 2024

sukunrt commented Oct 7, 2024 •

edited

Loading

vyzo commented Oct 7, 2024

Wondertan commented Oct 7, 2024

MarcoPolo commented Oct 9, 2024

vyzo commented Oct 9, 2024

MarcoPolo commented Oct 9, 2024 •

edited

Loading

vyzo commented Oct 9, 2024 •

edited

Loading

Stebalien commented Oct 18, 2024

MarcoPolo commented Oct 21, 2024

MarcoPolo commented Nov 7, 2024 •

edited

Loading

Issues with identify: identify failed to complete #2983

Issues with identify: identify failed to complete #2983

Comments

vyzo commented Sep 27, 2024 • edited Loading

vyzo commented Sep 27, 2024

sukunrt commented Oct 7, 2024 • edited Loading

vyzo commented Oct 7, 2024

Wondertan commented Oct 7, 2024

MarcoPolo commented Oct 9, 2024

vyzo commented Oct 9, 2024

MarcoPolo commented Oct 9, 2024 • edited Loading

vyzo commented Oct 9, 2024 • edited Loading

Stebalien commented Oct 18, 2024

MarcoPolo commented Oct 21, 2024

MarcoPolo commented Nov 7, 2024 • edited Loading

vyzo commented Sep 27, 2024 •

edited

Loading

sukunrt commented Oct 7, 2024 •

edited

Loading

MarcoPolo commented Oct 9, 2024 •

edited

Loading

vyzo commented Oct 9, 2024 •

edited

Loading

MarcoPolo commented Nov 7, 2024 •

edited

Loading