quic tpu: create fewer timeout futures #3151

alessandrod · 2024-10-13T14:34:20Z

Use a cancellation token instead of polling an Arc<bool> to exit connection tasks. Before this change we used to create a timeout future for each tx, and immediately cancel it since virtually all connections always have incoming txs pending. Cancelling timeout futures is expensive as it goes inside the tokio driver and takes a mutex. On high load (1M tps), cancelling timeout futures takes 8% (!) of run time.

With this change we create a cancellation token once when a connection is established and that's it - no overhead after that.

alessandrod

I've highlighted the main changes in the PR. The rest is whitespace change because of some reduced nesting.

alessandrod · 2024-10-14T01:22:31Z

streamer/src/nonblocking/quic.rs

+ // Wait for new streams. If the peer is disconnected we get a cancellation signal and stop
+ // the connection task.
+ let mut stream = select! {
+ stream = connection.accept_uni() => match stream {


Here we don't need the timeout. Even if a staked peer opens a connection and never sends anything, they can open maximum 8 of them, and they'll get flushed out by the LRU logic.

alessandrod · 2024-10-14T01:24:05Z

streamer/src/nonblocking/quic.rs

+ // Wait up to `wait_for_chunk_timeout`. If we don't get a chunk before then, we
+ // assume the stream is dead and stop the stream task. This can only happen if
+ // there's severe packet loss or the peer stop sending for whatever reason.
+ chunk = tokio::time::timeout(


Here we still need a timeout, because we want to limit the number of open-but-unfinished streams we let a peer open. I have a tokio patch coming to make this faster.

KirillLykov

Looks good to me: it replaces Arc<AtomicBool> with CancelationToken and all operations in select! are cancel-safe.

lijunwangs · 2024-10-14T16:51:54Z

programs/sbf/Cargo.lock

@@ -6837,6 +6837,7 @@ dependencies = [
 "solana-transaction-metrics-tracker",
 "thiserror",
 "tokio",
+ "tokio-util 0.7.1",


Any reason why the versions are different? Can we use 0.7.12 for both?

Because this is how dependency resolution works. Something in the dep graph requires <= 0.7.1 and because 0.7.12 and 0.7.1 are semver compatible it's using that. I didn't go and edit Cargo.lock - cargo did this change. If you look at Cargo.lock, it's already using 0.7.1 in many places.

Strange both are about solana-streamer's dependencies.

lijunwangs · 2024-10-14T16:54:08Z

streamer/src/nonblocking/quic.rs

 };

-const WAIT_FOR_STREAM_TIMEOUT: Duration = Duration::from_millis(100);
-pub const DEFAULT_WAIT_FOR_CHUNK_TIMEOUT: Duration = Duration::from_secs(10);
+pub const DEFAULT_WAIT_FOR_CHUNK_TIMEOUT: Duration = Duration::from_secs(1);


What is the reasoning changing this timeout value from 10s to 1s?

old timeout makes no sense - a validator has 1.6s worth of leader slots, what are we waiting for 10 seconds?

That was conservative -- it does forward if passing the leader slot. Note 1 < 1.6s. I would like to keep the old timeout to reduce the net change impact. If we are changing it, maybe a separate PR?

Note 1 < 1.6s.

Yes, this is intentional. Waiting ten seconds for a single chunk of one transaction makes no sense. The only way it can happen is if the peer is being intentionally malicious and has opened one (or more) streams and is not sending data in them.

it does forward if passing the leader slot

I'll check this but hopefully we don't do it for that long because that would also make no sense

I would like to keep the old timeout to reduce the net change impact. If we are changing it, maybe a separate PR?

To do this I'd have to make it do the old thing of sleeping in WAIT_FOR_CHUNK_TIMEOUT / 10 intervals or the tests take forever. So I'd have to add back the old code complexity and create * 10 the number of timeout futures - in a PR trying to reduce the number of timeout futures. To me "this is what we used to do" is not a good enough argument to justify the complexity and worse perf.

I am questioning the 1s is too small and a lot of more txns may timeout due to this. At least it need to be within the leader slot. We do need to be careful fixing one issue and introducing other potential issues.

Could you walk me through what would timeout and for what reason?

see this in the mainnet with the current 10s timeout value. The stream can timeout due to network conditions (client, in between, server). I also wonder why the cancel mechanism you implemented here cannot be used.

see this in the mainnet with the current 10s timeout value. The stream can timeout due to network conditions (client, in between, server).

We want to timeout garbage connections as soon as possible, we're trying to build a chain that goes fast. If you're sending txs from your dialup at home you should get disconnected.

Also this is a timeout for one single stream, a connection can still send other streams. So in case of legitimate packet loss (one or a few transactions), we wait up to 1s, otherwise we move on and keep reading other transactions from the same peer, who will receive a reset stream error and presumably retry the failed transactions.

What you're seeing in those metrics is probably peers timing out because the whole connection times out, and then because we're currently extremely bad and read many streams in parallel, we increment the metric many times.

So tldr: on legit packet loss, 1s seems more than enough (assuming decent RTT <=200ms). When a peer falls off the planet and the connection just stops sending any data, timing out as soon as possible is the best thing we can do to allocate resources to peers who are actually sending transactions.

I also wonder why the cancel mechanism you implemented here cannot be used.

I will let you role play as hacker for one day to answer that question!

I am not comfortable with the big change of the timeout value. Can we set it to 2s for now as an interim approach and observe in the production of its impacts?

Use a cancellation token instead of polling an Arc<bool> to exit connection tasks. Before this change we used to create a timeout future for each tx, and immediately cancel it since virtually all connections always have incoming tx pending. Cancelling timeout futures is expensive as it goes inside the tokio driver and takes a mutex. On high load (1M tps), cancelling timeout futures takes 8% (!) of run time. With this change we create a cancellation token _once_ when a connection is established and that's it - no overhead after that.

Don't wait 10s for one piece of a transaction to come before timing out. Lower the timeout to 2s, after which we assume a stream is dead. 2s is enough round trips to account for non catastrophic packet loss. If packet loss is causing >2s stream latency, the connection is hosed and the best thing we can do is save server resources for peers with better connectivity.

KirillLykov · 2024-10-16T10:43:14Z

In CI test_quic_server_multiple_connections_on_single_client_endpoint failed with message:

thread 'nonblocking::quic::test::test_quic_server_multiple_connections_on_single_client_endpoint' panicked at streamer/src/nonblocking/quic.rs:1880:9:
--
  | assertion `left == right` failed
  | left: 0
  | right: 2

We check that the number of closed connections is 2 while it is 0 here

alessandrod · 2024-10-16T14:25:41Z

In CI test_quic_server_multiple_connections_on_single_client_endpoint failed with message:
thread 'nonblocking::quic::test::test_quic_server_multiple_connections_on_single_client_endpoint' panicked at streamer/src/nonblocking/quic.rs:1880:9:
--
  | assertion `left == right` failed
  | left: 0
  | right: 2
We check that the number of closed connections is 2 while it is 0 here

From the code I removed

           // The min is to guard against a value too small which can wake up unnecessarily
            // frequently and wasting CPU cycles. The max guard against waiting for too long
            // which delay exit and cause some test failures when the timeout value is large.
            // Within this value, the heuristic is to wake up 10 times to check for exit
            // for the set timeout if there are no data.

This test is racy and was relying on sleep time being a lot larger than timeout time... I'll fix it properly tomorrow.

alessandrod requested review from KirillLykov and lijunwangs October 13, 2024 14:34

alessandrod force-pushed the quic-cancel branch 2 times, most recently from 201cf42 to c0abc9c Compare October 13, 2024 14:38

alessandrod commented Oct 14, 2024

View reviewed changes

KirillLykov previously approved these changes Oct 14, 2024

View reviewed changes

lijunwangs reviewed Oct 14, 2024

View reviewed changes

alessandrod dismissed KirillLykov’s stale review via 92e77a6 October 15, 2024 04:23

alessandrod force-pushed the quic-cancel branch 3 times, most recently from 12096ad to a032481 Compare October 15, 2024 14:46

alessandrod added 2 commits October 16, 2024 03:44

alessandrod force-pushed the quic-cancel branch from a032481 to ce3cc60 Compare October 16, 2024 03:44

lijunwangs approved these changes Oct 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quic tpu: create fewer timeout futures #3151

quic tpu: create fewer timeout futures #3151

alessandrod commented Oct 13, 2024

alessandrod left a comment

alessandrod Oct 14, 2024

alessandrod Oct 14, 2024

KirillLykov left a comment

lijunwangs Oct 14, 2024

alessandrod Oct 15, 2024 •

edited

Loading

lijunwangs Oct 15, 2024

lijunwangs Oct 14, 2024

alessandrod Oct 14, 2024

lijunwangs Oct 14, 2024

alessandrod Oct 15, 2024

alessandrod Oct 15, 2024

lijunwangs Oct 15, 2024

alessandrod Oct 15, 2024

lijunwangs Oct 15, 2024

alessandrod Oct 15, 2024

lijunwangs Oct 15, 2024

KirillLykov commented Oct 16, 2024

alessandrod commented Oct 16, 2024

quic tpu: create fewer timeout futures #3151

Are you sure you want to change the base?

quic tpu: create fewer timeout futures #3151

Conversation

alessandrod commented Oct 13, 2024

alessandrod left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KirillLykov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alessandrod Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KirillLykov commented Oct 16, 2024

alessandrod commented Oct 16, 2024

alessandrod Oct 15, 2024 •

edited

Loading