General thoughts about iroh-net #2860

CGamesPlay · 2024-10-29T13:29:27Z

CGamesPlay
Oct 29, 2024

Hello Team Iroh!

A side project of mine is building a VPN-ish on top of QUIC (point-to-point tunneling over QUIC). One of the design goals is supporting ad hoc connections between various machines, and Iroh seemed like a dead fit for this. My PoC previously used quinn to do handshaking between peers with self-signed certificates, and I migrated to iroh-net, which was a +695 -1185 line diff, plus I got relay server and STUN support for free. That's pretty awesome! So, I wanted to give a quick experience report. I hit a handful of problems when using the library, so this may come across critical, but it's given in good faith, and overall my experience was actually quite good.

Here's a bit more detail about my project: you can kind of think of it like "mosh but for port forwards". Users bootstrap a peer connection over SSH or some other channel, but afterwards port forwards happen over a QUIC tunnel. This means that peer discovery isn't relevant to me, but the relay server and STUN support are both useful. When connections between peers cannot be established, I can fall back to the original reliable channel to re-bootstrap it. This leads to my main question: how "long-lived" are connections? Given all the work going into multipath, it feels like I can expect an established connection to last, well, until it times out. But will Iroh switch back to relay servers mid-connection, for example?

I was surprised at the number and size of dependencies that iroh pulled in. Given my design, I disabled some of the features enabling DHT/discovery. Even so, my compilation times rose substantially as a result of migrating from quinn. I haven't checked, but I am also concerned that my final binary size will similarly balloon. I would be interested in knowing if it's possible to further reduce the number of dependencies that iroh pulls in (by disabling features that I don't need).

Endpoint::close is async and fallible. This means you can't close your endpoint in a Drop impl. It also returns anyhow::Result so it can fail... somehow? But it moves self, so the endpoint is gone even if it fails. What? I looked at the source code, and this method corresponds to quinn's close + quinn's wait_idle + an async fallible close of magic socket. This should be split into multiple different methods.

On that note, I'm disappointed that everything is an anyhow::Error. It basically means that my program cannot do anything meaningful with any error returned from Iroh. But also, as a developer, it means I have no idea what errors I should even expect from Iroh. Are the error possibilities relevant to me, or can I safely call unwrap? I can't tell with anyhow::Error. I suspect this one is just "better error handling is on the roadmap", but wanted to call it out as a sticking point for a new user of the project.

I also encountered a few things that are almost certainly bugs. First, when you use bind_addr_v6("[::1]:0".parse().unwrap()), iroh still returns a NodeAddr that includes my STUN addresses. I was surprised by this, since I used a loopback address for Iroh, so how it even reached a STUN server is beyond me. Wireshark confirmed that Iroh was not respecting my bind address and sending traffic on my public interface. Second, if I have a connection to myself (same process; different Endpoints), calling left_conn.close(...); right_conn.closed().await will never finish. I see CC frames in Wireshark, but the remote connection (right_conn) is never reported closed. Adding a sleep fixes this.

Overall, I'm quite pleased with the amount of work that Iroh is doing for me, and I'm excited to test it out in real-world conditions involving NAT. Kudos for the work done so far, and I hope that this report is helpful to the developers and other potential users.

flub · 2024-10-29T14:13:30Z

flub
Oct 29, 2024
Maintainer

Hi @CGamesPlay!

A side project of mine is building a VPN-ish on top of QUIC (point-to-point tunneling over QUIC). One of the design goals is supporting ad hoc connections between various machines, and Iroh seemed like a dead fit for this. My PoC previously used quinn to do handshaking between peers with self-signed certificates, and I migrated to iroh-net, which was a +695 -1185 line diff, plus I got relay server and STUN support for free. That's pretty awesome!

Cool!

This leads to my main question: how "long-lived" are connections? Given all the work going into multipath, it feels like I can expect an established connection to last, well, until it times out. But will Iroh switch back to relay servers mid-connection, for example?

There is no real limit on how long a connection can last. Theoretically there are some limits in QUIC because you can only send 2**62-1 packets and a few other similar limits. Practically however it depends on how stable the network is and whether the packets keep being delivered. Iroh configures the QUIC protocol to ping regularly to keep the connections as alive as possible in the face of NAT mappings etc. But maybe there are some timeouts to configure to tweak this a bit more, TransportConfig::max_idle_timeout comes to mind, in the long term we should have pretty sane defaults though.

When a direct network path stops working, iroh should fall back the relay path. I emphasise should because I've recently seen a bug report that suggests our default timeouts might not be connected correctly for this and we didn't address that yet. But the functionality for it is there. And this is an area that will improve when we switch to QUIC Multipath.

I was surprised at the number and size of dependencies that iroh pulled in. Given my design, I disabled some of the features enabling DHT/discovery. Even so, my compilation times rose substantially as a result of migrating from quinn. I haven't checked, but I am also concerned that my final binary size will similarly balloon. I would be interested in knowing if it's possible to further reduce the number of dependencies that iroh pulls in (by disabling features that I don't need).

Yes, this is kind of a known issue. We haven't done much work to reduce this yet, even if we occasionally spend some effort to de-duplicate some things. We have plans to make discovery mechanisms all optional using cargo features. But sometime we'll have to figure out how to make this smaller.

Endpoint::close is async and fallible. This means you can't close your endpoint in a Drop impl. It also returns anyhow::Result so it can fail... somehow? But it moves self, so the endpoint is gone even if it fails. What? I looked at the source code, and this method corresponds to quinn's close + quinn's wait_idle + an async fallible close of magic socket. This should be split into multiple different methods.

Closing the endpoint is definitely a rough edge right now. I think it will close on drop, but it is not the graceful close for sure. The choice to make it async is an attempt to nudge folks not to rely on drop, because it is not ideal. The choice to include wait_idle into it is again an attempt at improving usability, but perhaps not that successful yet. We've discussed this recently as well and certainly want to improve things, so feel free to open an issue and explain what you'd like to be able to do wrt to closing. I think that would be helpful.

On that note, I'm disappointed that everything is an anyhow::Error.

Yeah, this is also a known issue. The plan is to improve on this before 1.0 as well and there is already #2741 for this. To help this forward it's helpful to create issues for specific APIs where the anyhow error contains several cases which you need to distinguish between when you encounter those. That is the actionable feedback we need to improve the errors.

In most of our usages we've found little need to distinguish usually, because the decision on how to handle an error usually just depends on the API returning the error for our code. Which is why errors have stayed on anyhow for now.

as a developer, it means I have no idea what errors I should even expect from Iroh. Are the error possibilities relevant to me, or can I safely call unwrap?

If it's an error then you have to assume the thing failed. We don't return errors that are "safe to unwrap" currently.

I also encountered a few things that are almost certainly bugs. First, when you use bind_addr_v6("[::1]:0".parse().unwrap()), iroh still returns a NodeAddr that includes my STUN addresses. I was surprised by this, since I used a loopback address for Iroh, so how it even reached a STUN server is beyond me. Wireshark confirmed that Iroh was not respecting my bind address and sending traffic on my public interface.

Sounds interesting, would you mind creating an issue and sharing a DEBUG-level log with of when this happens? If you can write a reproducer for this that'd be great as well.

Second, if I have a connection to myself (same process; different Endpoints), calling left_conn.close(...); right_conn.closed().await will never finish. I see CC frames in Wireshark, but the remote connection (right_conn) is never reported closed. Adding a sleep fixes this.

This is also surprising to me, I kind of assume we do this in our tests regularly. If you could write a minimal reproducer for this and file it as an issue as well that would be great.

Overall, I'm quite pleased with the amount of work that Iroh is doing for me, and I'm excited to test it out in real-world conditions involving NAT. Kudos for the work done so far, and I hope that this report is helpful to the developers and other potential users.

Thanks for the thoughtful input! Feedback like this certainly allows us to find out what users experience!

0 replies

CGamesPlay · 2024-10-30T05:28:23Z

CGamesPlay
Oct 30, 2024
Author

But maybe there are some timeouts to configure to tweak this a bit more, TransportConfig::max_idle_timeout comes to mind, in the long term we should have pretty sane defaults though.

Oh interesting. I see that it's hard-coded to send keep alives every 1 second and time out after 30 seconds, and the config passed to the builder only modifies the behavior of incoming connections. So my connections won't last longer than 30 seconds of wifi downtime, no matter what I do.

We've discussed this recently as well and certainly want to improve things, so feel free to open an issue and explain what you'd like to be able to do wrt to closing. I think that would be helpful.

#2867

If it's an error then you have to assume the thing failed. We don't return errors that are "safe to unwrap" currently.

I think you're misunderstanding what I'm getting at. Any "boneheaded" error is "safe to unwrap", because the failure indicates a bug elsewhere in my code. Here are some examples in Iroh where you can safely call unwrap on an error, because there is no point in handling the failure at the call site.

iroh::metrics::try_init_metrics_collection. It can only fail if it was already initialized, and if my code knows that it has not been (e.g. because I call it at the top of main), then I can safely unwrap the result (or discard it, but unwrap guards against a logic error on my part).
Similarly, iroh::metrics::get_metrics can only fail if I didn't call try_init_metrics_collection successfully. If I know that I did, then I can safely unwrap this result.
iroh::net::Endpoint::close. I document this in iroh::net::Endpoint::close should not be fallible #2867.

If you can write a reproducer for this that'd be great as well.

#2866

If you could write a minimal reproducer for this and file it as an issue as well that would be great.

~~Working on this one.~~ This one turned out to be my fault 😣

3 replies

flub Oct 30, 2024
Maintainer

But maybe there are some timeouts to configure to tweak this a bit more, TransportConfig::max_idle_timeout comes to mind, in the long term we should have pretty sane defaults though.

Oh interesting. I see that it's hard-coded to send keep alives every 1 second and time out after 30 seconds, and the config passed to the builder only modifies the behavior of incoming connections. So my connections won't last longer than 30 seconds of wifi downtime, no matter what I do.

Yeah, I've been sort of aware about this but never wrote this down so clearly. Would you mind splitting this off into an issue as well? It clearly is wrong to force this exact behaviour on users via the connect side.

Thanks for creating the other issue and weighing in on the error issue with a much more reasonable approach. We'll follow up on those issues.

CGamesPlay Oct 31, 2024
Author

Created #2872.

Cheers. If any of these seem non-controversial (my thoughts about close being async appear to be controversial) and would make a good first PR, please let me know.

flub Nov 5, 2024
Maintainer

#2872 is probably not too controversial if you'd like to give it a go. Maybe a Builder::disable_ipv6 method for #2866 is also not that bad to give a go, but it will involve a bit of fixing up several internals to make it work so I'm guessing it's not as easy.

CGamesPlay · 2024-11-13T12:10:08Z

CGamesPlay
Nov 13, 2024
Author

What's the relationship between Endpoint::conn_type_stream and Connection::remote_address? It seems like remote_address should be deprecated and replaced with a method that returns a ConnectionType based on the current state of the connection.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General thoughts about iroh-net #2860

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

General thoughts about iroh-net #2860

CGamesPlay Oct 29, 2024

Replies: 3 comments · 3 replies

flub Oct 29, 2024 Maintainer

CGamesPlay Oct 30, 2024 Author

flub Oct 30, 2024 Maintainer

CGamesPlay Oct 31, 2024 Author

flub Nov 5, 2024 Maintainer

CGamesPlay Nov 13, 2024 Author

CGamesPlay
Oct 29, 2024

Replies: 3 comments 3 replies

flub
Oct 29, 2024
Maintainer

CGamesPlay
Oct 30, 2024
Author

flub Oct 30, 2024
Maintainer

CGamesPlay Oct 31, 2024
Author

flub Nov 5, 2024
Maintainer

CGamesPlay
Nov 13, 2024
Author