Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-connect #29

Open
tomberek opened this issue Nov 29, 2022 · 15 comments
Open

Re-connect #29

tomberek opened this issue Nov 29, 2022 · 15 comments

Comments

@tomberek
Copy link

I'm getting some re-connection difficulties.

  1. Two clients connect (one is a phone).
  2. Phone display is turned off.
  3. 2 minutes later the connection is broken and the other browser shows this error in console.
crypto-254fb80e.js:232 Error: Connection failed.
    at Peer._onConnectionStateChange (simple-peer-light.js:550:28)
    at Peer._pc.onconnectionstatechange (simple-peer-light.js:105:12)

I attempted various combinations of .leave() and .joinRoom without luck.

@jeremyckahn
Copy link
Contributor

This seems like it may be a simple-peer-light issue. I've noticed that it doesn't handle reconnections after system suspension (sleep) well. Specifically, when there is an ICE connection failure, it doesn't attempt to reconnect: https://github.com/mitschabaude/simple-peer-light/blob/503028ee121687d87bb7042b24f6934a0b06ca14/index.js#L745-L752

I wonder if the connection failure can be detected by Trystero and so that it can attempt to reconnect? If not, simple-peer-light may need to be patched directly.

@jeremyckahn
Copy link
Contributor

jeremyckahn commented Nov 30, 2022

Considering this:

I attempted various combinations of .leave() and .joinRoom without luck.

I don't think this is a failure mode that Trystero or application code can handle. It seems that Trystero is appropriately cleaning up Peer instances:

trystero/src/room.js

Lines 324 to 330 in 78a65e7

leave: () => {
entries(peerMap).forEach(([id, peer]) => {
peer.destroy()
delete peerMap[id]
})
onSelfLeave()
},

export default (onPeer, onSelfLeave) => {

trystero/src/torrent.js

Lines 253 to 262 in 78a65e7

return room(
f => (onPeerConnect = f),
async () => {
const infoHash = await infoHashP
trackerUrls.forEach(url => delete socketListeners[url][infoHash])
delete occupiedRooms[ns]
clearInterval(announceInterval)
cleanPool()
}

trystero/src/torrent.js

Lines 223 to 231 in 78a65e7

const cleanPool = () => {
entries(offerPool).forEach(([id, {peer}]) => {
if (!handledOffers[id] && !connectedPeers[id]) {
peer.destroy()
}
})
handledOffers = {}
}

So, it doesn't look like there's much that Trystero can do if leaving and rejoining the room doesn't fix the connection. According to the WebRTC spec, a "failed" state is terminal:

The RTCIceTransport has finished gathering, received an indication that there are no more remote candidates, finished checking all candidate pairs, and all pairs have either failed connectivity checks or lost consent, and either zero local candidates were gathered or the PAC timer has expired [RFC8863]. This is a terminal state until ICE is restarted.

Luckily, there appears to be a RTCPeerConnection.restartIce() method. It does not appear that simple-peer-light ever calls it: https://github.com/mitschabaude/simple-peer-light/search?q=restartIce&type=code

So, it seems that simple-peer-light will need to be patched to enable reconnections due to ICE failures. After that, it might be worth adding some connection retry logic to Trystero.

EDIT: There appears to be prior art that we may be able to use for implementing ICE restarts here: https://github.com/feross/simple-peer/pull/771/files

@tomberek
Copy link
Author

tomberek commented Dec 1, 2022

The entire process cannot be restarted upon identification of the failure? I mean from the side of the suspended client.

@jeremyckahn
Copy link
Contributor

jeremyckahn commented Dec 1, 2022

The entire process cannot be restarted upon identification of the failure? I mean from the side of the suspended client.

I'm not sure. It would probably take some experimentation. What I'm struggling with is how to trigger an ICE candidate failure so that error handling could be tested and debugged. Do you have a reliable way of doing that?

@jeremyckahn
Copy link
Contributor

I experienced this issue again today. I noticed in Chrome's chrome://webrtc-internals/ page that there were four connections that I couldn't get rid of via leave()ing and re-joinRoom()ing again. I could create ten new ones and cleanly remove them by leaving and rejoining, but the four "zombie" connections persisted until I refreshed the page. I suspect that "zombie" connections may be preventing reconnection (though I cannot prove that).

I need to spend more time to better understand Trystero's connection lifecycle, but I noticed this onDisconnect function:

const onDisconnect = id => delete connectedPeers[id]

It is called upon peer close events:

peer.on(events.close, () => onDisconnect(val.peer_id))

peer.on(events.close, () => onDisconnect(val.peer_id))

I wonder if this could be bound to the peer's events.error event as well? I'm thinking that connection error events aren't fully handled by Trystero, so they don't get properly cleaned up and therefore prevent reconnection.

@dmotz do you thoughts about handling error events by calling onDisconnect when there is an error?

@jeremyckahn
Copy link
Contributor

I'm able to reliably reproduce the error state with two computers. On one computer, I just disable WiFi, wait 30-60 seconds, and turn it back on. The two peers can no longer connect. Leaving and rejoining the room on either computer doesn't reestablish connection.

However, I noticed that if both computers leave and rejoin the room, connection can successfully be reestablished. Sometimes it takes more than one try to leave and rejoin the room.

This suggests that the re-connection issue can be addressed within Trystero. It's worth exploring what the leave action is doing to clean up connections and see if it can be reused during a live connection as peers cycle in and out of the room.

@jeremyckahn
Copy link
Contributor

I've got a PR to fix this at #30. @tomberek please test it out and see if it fixes the issue for you.

@tomberek
Copy link
Author

tomberek commented Dec 7, 2022

Sadly, no. The same sequence of events still did not lead to a re-connection. I suspect one of the peers is trying, but no the other.

@jeremyckahn
Copy link
Contributor

This is consistent with my experience on mobile, even with this fix. There appears to be some number of iOS-specific issues left to deal with, so we will have to iterate on a full solution for this issue.

Does #30 at least improve reconnection success for you on desktop Chrome and Firefox?

@tomberek
Copy link
Author

tomberek commented Dec 8, 2022

@jeremyckahn Yes, #30 improves leave/rejoin behavior between desktop browsers. Thanks!

@jeremyckahn
Copy link
Contributor

That's great to hear! Thanks for testing @tomberek.

If #30 seems good to @dmotz, I recommend merging and releasing the change when it's convenient but leaving this issue open. I predict that this problem will take time and iteration to truly solve, but it can be gradually improved over time with incremental fixes.

dmotz added a commit that referenced this issue Dec 9, 2022
fix: (#29) Replace offer when peer is disconnected
@dmotz
Copy link
Owner

dmotz commented Dec 9, 2022

I released the related PR as 0.11.8. I haven't had a chance to look into this much yet, but will try to explore soon. It may be worth doing some tests with the Firebase strategy to determine whether the problem lies on the matchmaking side or the WebRTC library side. If it's the latter I'm open to finding a new RTC connection management lib or building one in.

@jeremyckahn
Copy link
Contributor

Thank you for the 0.11.8 release, @dmotz!

It may be worth doing some tests with the Firebase strategy to determine whether the problem lies on the matchmaking side or the WebRTC library side.

That's a great idea. I've only used the WebTorrent strategy so far so I don't know how its reliability might compare to Firebase in practice. I've noticed that clients are generally able to connect to WebTorrent servers and exchange SDPs. I think the remaining issues (at least on desktop browsers) are specific longer-running sessions, which is going to be challenging to reproduce and debug.

I think the iOS issues are a result of limitations that Apple has built in paired with platform-specific bugs. For example:

There may be only so much we can do for iOS until Apple fixes their platform-level issues. 😕


In case it's helpful, https://www.farmhand.life/#online/global (source) is running my Trystero branch (which, at the time of this comment, is in sync with the 0.11.8 release). I leave sessions running for days at a time, which is how I can observe the issues related to longer-lived sessions. It may be a helpful way to manifest issues for testing and debugging purposes.

@aehlke
Copy link

aehlke commented May 17, 2023

@jeremyckahn any updates on your experience?

@jeremyckahn
Copy link
Contributor

@jeremyckahn any updates on your experience?

No, it's about the same since my last comment. I haven't been exploring connection improvements since then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants