Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quic tpu: create fewer timeout futures #3151

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

alessandrod
Copy link

Use a cancellation token instead of polling an Arc<bool> to exit connection tasks. Before this change we used to create a timeout future for each tx, and immediately cancel it since virtually all connections always have incoming txs pending. Cancelling timeout futures is expensive as it goes inside the tokio driver and takes a mutex. On high load (1M tps), cancelling timeout futures takes 8% (!) of run time.

With this change we create a cancellation token once when a connection is established and that's it - no overhead after that.

Copy link
Author

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've highlighted the main changes in the PR. The rest is whitespace change because of some reduced nesting.

// Wait for new streams. If the peer is disconnected we get a cancellation signal and stop
// the connection task.
let mut stream = select! {
stream = connection.accept_uni() => match stream {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we don't need the timeout. Even if a staked peer opens a connection and never sends anything, they can open maximum 8 of them, and they'll get flushed out by the LRU logic.

// Wait up to `wait_for_chunk_timeout`. If we don't get a chunk before then, we
// assume the stream is dead and stop the stream task. This can only happen if
// there's severe packet loss or the peer stop sending for whatever reason.
chunk = tokio::time::timeout(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we still need a timeout, because we want to limit the number of open-but-unfinished streams we let a peer open. I have a tokio patch coming to make this faster.

KirillLykov
KirillLykov previously approved these changes Oct 14, 2024
Copy link

@KirillLykov KirillLykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me: it replaces Arc<AtomicBool> with CancelationToken and all operations in select! are cancel-safe.

@@ -6837,6 +6837,7 @@ dependencies = [
"solana-transaction-metrics-tracker",
"thiserror",
"tokio",
"tokio-util 0.7.1",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why the versions are different? Can we use 0.7.12 for both?

Copy link
Author

@alessandrod alessandrod Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is how dependency resolution works. Something in the dep graph requires <= 0.7.1 and because 0.7.12 and 0.7.1 are semver compatible it's using that. I didn't go and edit Cargo.lock - cargo did this change. If you look at Cargo.lock, it's already using 0.7.1 in many places.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange both are about solana-streamer's dependencies.

};

const WAIT_FOR_STREAM_TIMEOUT: Duration = Duration::from_millis(100);
pub const DEFAULT_WAIT_FOR_CHUNK_TIMEOUT: Duration = Duration::from_secs(10);
pub const DEFAULT_WAIT_FOR_CHUNK_TIMEOUT: Duration = Duration::from_secs(1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning changing this timeout value from 10s to 1s?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old timeout makes no sense - a validator has 1.6s worth of leader slots, what are we waiting for 10 seconds?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was conservative -- it does forward if passing the leader slot. Note 1 < 1.6s. I would like to keep the old timeout to reduce the net change impact. If we are changing it, maybe a separate PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note 1 < 1.6s.

Yes, this is intentional. Waiting ten seconds for a single chunk of one transaction makes no sense. The only way it can happen is if the peer is being intentionally malicious and has opened one (or more) streams and is not sending data in them.

it does forward if passing the leader slot

I'll check this but hopefully we don't do it for that long because that would also make no sense

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep the old timeout to reduce the net change impact. If we are changing it, maybe a separate PR?

To do this I'd have to make it do the old thing of sleeping in WAIT_FOR_CHUNK_TIMEOUT / 10 intervals or the tests take forever. So I'd have to add back the old code complexity and create * 10 the number of timeout futures - in a PR trying to reduce the number of timeout futures. To me "this is what we used to do" is not a good enough argument to justify the complexity and worse perf.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am questioning the 1s is too small and a lot of more txns may timeout due to this. At least it need to be within the leader slot. We do need to be careful fixing one issue and introducing other potential issues.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you walk me through what would timeout and for what reason?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2024-10-15 at 10 52 34 AM see this in the mainnet with the current 10s timeout value. The stream can timeout due to network conditions (client, in between, server). I also wonder why the cancel mechanism you implemented here cannot be used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see this in the mainnet with the current 10s timeout value. The stream can timeout due to network conditions (client, in between, server).

We want to timeout garbage connections as soon as possible, we're trying to build a chain that goes fast. If you're sending txs from your dialup at home you should get disconnected.

Also this is a timeout for one single stream, a connection can still send other streams. So in case of legitimate packet loss (one or a few transactions), we wait up to 1s, otherwise we move on and keep reading other transactions from the same peer, who will receive a reset stream error and presumably retry the failed transactions.

What you're seeing in those metrics is probably peers timing out because the whole connection times out, and then because we're currently extremely bad and read many streams in parallel, we increment the metric many times.

So tldr: on legit packet loss, 1s seems more than enough (assuming decent RTT <=200ms). When a peer falls off the planet and the connection just stops sending any data, timing out as soon as possible is the best thing we can do to allocate resources to peers who are actually sending transactions.

I also wonder why the cancel mechanism you implemented here cannot be used.

I will let you role play as hacker for one day to answer that question!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not comfortable with the big change of the timeout value. Can we set it to 2s for now as an interim approach and observe in the production of its impacts?

Use a cancellation token instead of polling an Arc<bool> to exit
connection tasks. Before this change we used to create a timeout future
for each tx, and immediately cancel it since virtually all connections
always have incoming tx pending. Cancelling timeout futures is expensive
as it goes inside the tokio driver and takes a mutex. On high load (1M
tps), cancelling timeout futures takes 8% (!) of run time.

With this change we create a cancellation token _once_ when a connection
is established and that's it - no overhead after that.
Don't wait 10s for one piece of a transaction to come before timing out.
Lower the timeout to 2s, after which we assume a stream is dead. 2s is
enough round trips to account for non catastrophic packet loss. If
packet loss is causing >2s stream latency, the connection is hosed and
the best thing we can do is save server resources for peers with better
connectivity.
@KirillLykov
Copy link

In CI test_quic_server_multiple_connections_on_single_client_endpoint failed with message:

thread 'nonblocking::quic::test::test_quic_server_multiple_connections_on_single_client_endpoint' panicked at streamer/src/nonblocking/quic.rs:1880:9:
--
  | assertion `left == right` failed
  | left: 0
  | right: 2

We check that the number of closed connections is 2 while it is 0 here

@alessandrod
Copy link
Author

In CI test_quic_server_multiple_connections_on_single_client_endpoint failed with message:

thread 'nonblocking::quic::test::test_quic_server_multiple_connections_on_single_client_endpoint' panicked at streamer/src/nonblocking/quic.rs:1880:9:
--
  | assertion `left == right` failed
  | left: 0
  | right: 2

We check that the number of closed connections is 2 while it is 0 here

From the code I removed

           // The min is to guard against a value too small which can wake up unnecessarily
            // frequently and wasting CPU cycles. The max guard against waiting for too long
            // which delay exit and cause some test failures when the timeout value is large.
            // Within this value, the heuristic is to wake up 10 times to check for exit
            // for the set timeout if there are no data.

This test is racy and was relying on sleep time being a lot larger than timeout time... I'll fix it properly tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants