-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a "transient" network connectivity state #2696
Conversation
d624653
to
11515f0
Compare
if !forceDirect { | ||
if rh.Network().Connectedness(pi.ID) == network.Connected { | ||
connectedness := rh.Network().Connectedness(pi.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we wouldn't try to perform routing if we're just going to wait on an ongoing connection. But that's not a new issue.
11515f0
to
57736f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with these changes. We just need to
- Fix this for the close path.
- Keep existing Connect behavior. I'll raise a PR to make Connect consistent with NewStream and return network.ErrTransientConn when appropriate.
- Deprecate the items in network.Connectedness that we don't care about.
p2p/net/swarm/swarm.go
Outdated
// Notify goroutines waiting for a direct connection | ||
if !c.Stat().Transient { | ||
newState := network.Transient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should calculate newState just after we add it to conns map with connectednessUnlocked. We already have the lock it's a small set to check anyway.
There's a problem here that if someone makes a relayed connection after we have a direct connection this will send a limited connectivity event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I addressed it correctly, can you please double check @sukunrt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need.
oldState := s.connectednessUnlocked(p)
s.conns.m[p] = append(s.conns.m[p], c)
newState := s.connectednessUnlocked(p)
This prevents us from sending a PeerConnectedness Limited event when we make a relayed connection after a direct connection.
03131e9
to
495cd64
Compare
@sukunrt is there any remaining work here, or awaiting your final review? |
I'll review this as well |
p2p/net/swarm/swarm.go
Outdated
// changed notifications consider closed connections. When remote closes a connection | ||
// the underlying transport connection is closed first, so tracking changes to the connectedness | ||
// state requires considering this recently closed connections impact on Connectedness. | ||
func (s *Swarm) connectednessUnlocked(p peer.ID, considerClosed bool) network.Connectedness { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't make sense to me. I don't really understand the point of the body either.
It seems like we have two cases:
- We have multiple conns to a peer and we close one. If any other conn exists and is open then we are connected.
- We only have one conn to a peer a close it. The peer is no longer connected.
Why are we potentially returning Connected
in the case that every conn in the map returns c.IsClosed()=true
with considerClosed=true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also have limited connections to the peer. So once a connection is closed, we check:
- If there is at least 1 unlimited connection, return
Connected
- Else, if there is at least 1 limited connection, return
Limited
- Else, return
NotConnected
Why are we potentially returning
Connected
in the case that every conn in the map returnsc.IsClosed()=true
withconsiderClosed=true
It is true. However, considerClosed
is true
only when called from addConn
and removeConn
, which hold the lock. removeConn
makes sure to remove the connection from s.conns.m
, so the function will not return Connected
if we removed the last unlimited connection.
Though, I don't really understand why considerClosed
is needed at all. @sukunrt could you provide some context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is confusing.
There are two different states this function is called from:
When closing the connection from our side by calling Close, this function is called when the underlying transport connection is not closed.
When closing the connection because the remote Closed the connection, this function is called when the underlying transport connection has already been closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider that we are connected to the peer over a QUIC and a relayed connection.
Now the peer closes the QUIC connection. So we have moved from a Connected
state to Limited
connectivity state.
To make this work, when calculating oldState in removeConn we need to consider connectivity assuming QUIC connection was still connected.
Alternatively we can store the last sent connectedness event for a peer. That would make this calculation cleaner at the expense of more state for an edge case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively we can store the last sent connectedness event for a peer. That would make this calculation cleaner at the expense of more state for an edge case.
I think we should do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
We have a deadlock in our event sending mechanism when the consumer closes the connection in the subscription goroutine. Closing the connection causes the subscription goroutine to emit an event and assuming the subscription queue is full this will deadlock. Check the test in 57c688e There are two options here:
We also need 2 to ensure that we send Limited vs NotConnected events in the right order. In case we are connected to a peer via a relay and a direct connection and the direct connection and relayed connection both close concurrently, right now there is a race between which event will be sent last, 2 ensures that NotConnected is always sent last. I have implemented 2 in 0dcc73b I'll refactor a bit if we do decide to go ahead with 2. |
12d5826
to
c68a035
Compare
p2p/net/swarm/swarm.go
Outdated
// This ensures that if a event subscriber closes the connection from the subscription goroutine | ||
// this doesn't deadlock | ||
s.refs.Add(1) | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really like to avoid spawning a go routine on close
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we shouldn't do this. I haven't read the rest of the patch so I might be missing an obvious solution, but it looks like we're just trying to "raise" a flag for this peer.
I'd just avoid channels entirely and use the pattern I used in https://github.com/ipfs/boxo/blob/2559ec2c7b93ac6905fe83d7a4b38b8e9a703b0a/bitswap/network/connecteventmanager.go#L69-L81, https://github.com/ipfs/boxo/blob/2559ec2c7b93ac6905fe83d7a4b38b8e9a703b0a/bitswap/network/connecteventmanager.go#L97-L122. It's not simple, but it works and avoids stacking goroutines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven’t looked closely, but it seems like changeQueue is unbounded. Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, but it shouldn't be an issue in practice. We only add a peer to the queue the first time the connectivity state changes. If we end up with a lot of "dead" peers in this queue, we should burn through them very quickly (transition from disconnected -> disconnected doesn't require any events).
We can do better by removing peers that have transitioned back to the same state from the queue, but it's likely not worth the complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a queue with this will work
We can do better by removing peers that have transitioned back to the same state from the queue, but it's likely not worth the complexity.
We do need the complexity because a malicious peer can connect, disconnect very quickly causing this queue to grow quickly if a consumer is slow. It can also do this with different peer ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general rule we should have no unbounded queues in the core logic. In this case I can think of a way that a reasonable application can fall victim to a dos attack.
Example: An application opens a new stream and does an exchange on a newly connected peer. If this is done by the same go routine that reads events then the attacker controls the rate events are popped off, and thus exploit this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, I'd agree. But in cases like this we can argue that the queue will hit a steady-state max size and stop growing (assuming the read side isn't deadlocked forever). It's not great, but if someone can DoS us due to this queue, they can also DoS us by running us out of memory before GC can kick in.
But ideally we'd just have some form of backpressure somewhere. And I think that's what we'll need given #2696 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying an approach where we have backpressure on the add side and to have an unbounded queue on the remove side(close connection) because the remove side is effectively bounded by the total number of connections since on the add connection side we do have backpressure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does 5271dd8 1abc063 look?
We apply backpressure when adding connections to the swarm. This allows us to enqueue remove conn events in a slice without spawning a goroutine. Back pressure on adding connections limits the slice length. We also ensure that we send at least 1 disconnection event for every peer we connect to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have time for a full review but:
- I'd be careful to avoid too many channels/goroutines. Go tends to push you in that direction, but it can quickly get harder to understand than simple locks & conditions.
- Make sure backpressure is applied all the way to accepting connections in the first place. It sounds like this might be a gap in libp2p and may be something we want to fix in a separate patch.
(I defer to @MarcoPolo on all of this, I'm just kibitzing from the sidelines)
cee9aa1
to
02a74d8
Compare
9a247a9
to
a16b08d
Compare
Previously, we'd consider "transiently" connected peers to be connected. This meant: 1. We wouldn't fire a second event when transitioning to "really connected". The only option for users was to listen on the old-style per-connection notifications. 2. "Connectedness" checks would be a little too eager to treat a peer as connected. For 99% of users, "transient" peers should be treated as disconnected. So while it's technically a breaking change to split-out "transient" connectivity into a separate state, I expect it's more likely to fix bugs than anything. Unfortunately, this change _did_ require several changes to go-libp2p itself because go-libp2p _does_ care about transient connections: 1. We want to keep peerstore information for transient peers. 2. We may sometimes want to treat peers as "connected" in the host. 3. Identify still needs to run over transient connections. fixes #2692
a16b08d
to
25ccbf2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question, but I think this looks a lot cleaner than what was there before. Thanks!
e9b2300
to
b542b15
Compare
Previously, we'd consider "transiently" connected peers to be connected. This meant:
Notify
interface.For 99% of users, "transient" peers should be treated as disconnected. So while it's technically a breaking change to split-out "transient" connectivity into a separate state, I expect it's more likely to fix bugs than anything.
Unfortunately, this change did require several changes to go-libp2p itself because go-libp2p does care about transient connections:
fixes #2692