Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with identify: identify failed to complete #2983

Open
vyzo opened this issue Sep 27, 2024 · 11 comments
Open

Issues with identify: identify failed to complete #2983

vyzo opened this issue Sep 27, 2024 · 11 comments
Assignees
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@vyzo
Copy link
Contributor

vyzo commented Sep 27, 2024

The problem: when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream, with the error indication that identified failed to complete. This happens consistently in our node, and I can trigger it reliably; so there is some bug related to identify.

Relevant logs:

2024-09-27T12:48:24.680Z	DEBUG	basichost	basic/basic_host.go:483	negotiated: /ipfs/id/push/1.0.0 (took 27.52µs)
2024-09-27T12:48:24.680Z	DEBUG	net/identify	identify/id.go:543	/ipfs/id/push/1.0.0 received message from 12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT /ip4/45.139.212.29/udp/17846/quic-v1
2024-09-27T12:48:30.475Z	DEBUG	basichost	basic/basic_host.go:483	negotiated: actor/root/messages/0.0.1 (took 48.66µs)
2024-09-27T12:48:30.479Z	DEBUG	actor	actor/dispatch.go:205	dispatching message from {BAAREICLTJJ63SCP5MUFVGX7GUM42QL2I5E46I355CASAEDIL55VX4SHTA====== did:key:z6MkjYP7Q8m862LdqNtExNMcH1BzgAkYTVp6QC5gHesktKbu {12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT root}} to /public/hello
2024-09-27T12:48:30.479Z	DEBUG	node	node/public.go:59	hello from 12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT
2024-09-27T12:48:30.479Z	DEBUG	swarm2	swarm/swarm.go:475	[12D3KooWFcvZkrs1LTQCdGEVPYi7diCL76Z9f1LcgcMMspujKWTj] opening stream to peer [12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT]
2024-09-27T12:48:33.450Z	WARN	libp2p	libp2p/libp2p.go:380	send: failed to open stream to peer 12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT: identify failed to complete: context deadline exceeded
Version Information
	github.com/libp2p/go-libp2p v0.36.4
@vyzo
Copy link
Contributor Author

vyzo commented Sep 27, 2024

I have worked around it by manually making streams and skipping the identify wait, but there certainly seem to be a bug in identify here.

@p-shahi p-shahi added the kind/bug A bug in existing code (including security flaws) label Sep 27, 2024
@sukunrt
Copy link
Member

sukunrt commented Oct 7, 2024

Is there any more info you can provide? A trivial connect -> disconnect -> connect test doesn't reproduce this for me.

when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream

Do you mean when you open a stream to the peer after reconnecting, the error is identify failed to complete?

I see DEBUG log about NewStream failing with id failed to complete
Do you get any DEBUG logs from the identify package which are relevant to the peer?

@vyzo
Copy link
Contributor Author

vyzo commented Oct 7, 2024

I am trying to understand the problem better, hipefully i can get you a good log package to diagnose this.

But yes, we fail to open streams because identify fails to complete.

@Wondertan
Copy link
Contributor

We observed similar behavior when event bus subscriptions were not read fast enough on our side. A client connects and initiates identify; a server processes a new connection in the swarm and blocks never reaching start and thus not processing identify streams. JFYI

@MarcoPolo
Copy link
Collaborator

when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream

I'm not sure I parse this sentence correctly. I'm understanding it as meaning:

  1. Peer A is connected to peer B.
  2. Peer A disconnects to peer B; then reconnects to peer B.
  3. Peer B attempts to open a stream to peer A.
  4. Peer B fails to open a stream.

Is that understanding correct? Or is it instead:

  1. Peer A is connected to peer B.
  2. Peer A disconnects to peer B; then reconnects to peer B.
  3. Peer A attempts to open a stream to peer B.
  4. Peer A fails to open a stream.

The issue kind of sounds like we aren't picking the best conn, and then we try to identify on it.

@vyzo
Copy link
Contributor Author

vyzo commented Oct 9, 2024

It is the first scenario.

@MarcoPolo
Copy link
Collaborator

MarcoPolo commented Oct 9, 2024

ah okay. That makes sense. The issue is probably that Peer B doesn't realize that the "best connection" it picked is actually closed/disconnected. So it times out on waiting for that connection to Identify (it never will).

We can be smarter here and interrupt with an even better connection if a new one appears.

@vyzo
Copy link
Contributor Author

vyzo commented Oct 9, 2024

Maybe we should have a collective identify completion channel for the peer, and not one for each conn.

@Stebalien
Copy link
Member

This relates to #2355 and attaching protocol information to connections instead of peers.

@MarcoPolo
Copy link
Collaborator

fyi, I have a branch I'm working on that should solve this issue and improve the best connection logic. The basic idea is to create a small new service that subscribes to Identify events and fulfills request to return the best connection for a peer that supports a given protocol and other criteria (e.g. is it a limited connection? Prefer a connection with existing streams). I'll get it pushed soon after the next go-libp2p release.

@MarcoPolo MarcoPolo self-assigned this Nov 4, 2024
@MarcoPolo
Copy link
Collaborator

MarcoPolo commented Nov 7, 2024

These comments together:

The problem: when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream, with the error indication that identified failed to complete. This happens consistently in our node, and I can trigger it reliably; so there is some bug related to identify.

and

I have worked around it by manually making streams and skipping the identify wait

Makes me rule out that this is only an issue with a dropped connection, like what I mentioned. Since in that case the workaround of manually making streams should also not work.

My current theory is that this is related to the eventbus being stalled as @Wondertan points out. It would be good to revisit #2361. The main argument against it was "We already have metrics, we don't need metrics AND logs." Looking at this again now, I still think having logs in addition would be nice, since setuping grafana and prometheus is non-trivial (and a big ask of our users), and issues like these should be easier to debug.

I'll make a PR to add this logging. I'll tag vyzo to try it and see if that is indeed their issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

No branches or pull requests

6 participants