-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect failures related to TCP simultaneous open and retry #2925
Comments
Thanks for the detailed report @elecbug! Have you noticed this issue in real networks, or just localhost/low-latency home networks? My first suspicion is that this is related to TCP simultaneous open feature, since that causes confusion as to who is the server or client. Could you try running your repro with a small change for me to disprove this theory? Please add these lines before your order := (bytes.Compare([]byte(host.ID()), []byte(found)))
if order == -1 {
time.Sleep(100*time.Millisecond)
}
// existing code below
err = host.Connect(ctx, found) This basically has one side wait a bit if its peer id is lexicographically less. This should remove any chance for simultaneous open. If the issue still happens after that change, please report back and I'll dive deeper. |
The experimental environment was a dockers warm network built with about eight servers. First of all, it seems that the error no longer occurs in the way you suggested. One request, then can I know the code area where the near-simultaneous call of |
The "problem" is that TCP allows two endpoints to establish a connection if "two TCP simultaneously initiate the procedure" May I suggest using the QUIC stack instead? You won't have these issues there. |
Closing this as it's only an issue in local environments, and there isn't a great fix. |
Note: This was fixed in multistream (not the best fix, but the best way to do it without breaking things), but then the fix was removed. |
Context: #2330 |
I had the same problem in my project, basically this occurs because you are using mDNS, when after you create the peer if you wait a while before starting mDNS this does not happen. |
Really, the main issue is reuseport. An alternative is to use a different source port for outbound connections than the inbound port, but that makes it more difficult to detect port mappings in NATs, IIRC. |
That's correct. reusing the port for dialing out lets us learn about our port mapping and our observed address as seen by our peer. Maybe we should change the circumstances that trigger this. For example, in mDNS we can randomly delay publishing ourselves to avoid the simultaneous open in local environments. |
It's not just MDNS. Any scenario causing two nodes to try to connect at the same time will trigger it, especially if those nodes are far apart (e.g., synchronized reconnect timers, etc.). |
Sorry for not being clear. I did not say this is a problem in just mDNS. I outlined a way to avoid this problem when peers learn about each other at the same time via mDNS. I'm not sure how common this over the internet. I've not found any data to suggest this happens frequently. And we have some data that shows we don't see this (libp2p/specs#573). My intuition is that this is mainly a problem in networks with low latency such as the loopback interface and local networks, as peers can learn about each other quickly in those environments and send SYN packets to each other quickly. If we accept the assumption that this is only an issue in local networks, we could not reuse the listen port for those networks. However, even if we do that, it would still be better to avoid having two connections when one would suffice. Therefore, changes to protocols that cause nodes to simultaneously connect would still be preferred. |
You can detect simultaneous open in the security protocol, for example when a TLS client receives a TLS ClientHello. crypto/tls emits a really specific error. Then you can do a randomized backoff. Given how rare simultaneous open is in the wild, this doesn't cost you a lot. Or just use QUIC, and avoid the entire simultaneous open nonsense... |
Is it possible to implement quic-go with pnet? |
No, there's no way to do that, see #1432. |
It is more common in networks with high latency because SYN packets are more likely to cross on the wire. Of course, you need a simultaneous trigger. But that can happen if an application has non-randomized reconnect timers and/or some central server signals two nodes to connect to each-other at the same time.
Yeah, this is the right way to fix it (that or use QUIC). And now that we have the holepunching logic, it should "just work" (return a retry error, backoff some amount of time (possibly even based on the peer IDs) and short-circuit if we see a new incoming connection). |
@cpeliciari I'd love to hear why you use PNET here: #2942. |
I like this idea. Let's do it. We also have to handle the case for when the failure happens in the multistream select layer, but it should be straightforward. Reopening to track this fix. |
While working on my personal project, I noticed that when using the noise protocol, both peers are intermittently getting error messages when they try to connect to each other at the same time.
Link's project is a reproduction of that part. and I solved the problem by storing each other in the peer store and connecting streams without trying to connect directly at the moment.
The file below is a log.txt file included in the link, which shows that noise errors occur intermittently when a docker is used to create an isolated environment and try to connect by repeatedly creating/removing another peer with one peer.
If you want to reproduce the situation, you can reproduce it if you follow the sequence of test.sh .
thanks for reading and efforting.
Version Information
The text was updated successfully, but these errors were encountered: