Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix send buffer misalignment issue #1307

Closed
wants to merge 5 commits into from

Conversation

akoshelev
Copy link
Collaborator

@akoshelev akoshelev commented Sep 23, 2024

The side-effect of this issue is described in #1300. The issue was how Gateway set up the read size compared to total capacity of a single buffer.

As a reminder, there are three configurable values inside each send buffer:

  • Total capacity, bytes. This describes how wide the send buffer is and has a direct relationship to active work. Total capacity must be able to accommodate all in-flight items inside the buffer.
  • Record size, bytes. How many bytes a single record takes on this buffer. Every send buffer is homogeneous, i.e. records must be of the same size.
  • Read size, bytes. How many bytes are taken off this buffer by the network layer at once. It waits until there are at least that many bytes filled up from the head of the buffer, before taking them off and releasing this capacity.

What was missing before this change is that read size must be aligned with total capacity. If it is not, then buffer can be left half-full, which was happening in #1300.

The failure scenario was setting up the record size to 512 (record * vectorization $$32 \times 16$$) and total capacity of 10 items (5120 bytes) with default read size of 2048 bytes.

Now, total capacity in this case is $$2.5\times 2048$$ and it leads to a deadlock. First 8 records get sent, but the remaining two get stuck.

Why was it working before?

Before DP, active work for small inputs was set to the input size and it remains unchanged for the duration of the query. So record 10 closes the channel and triggers a flush. With DP, we add more records, so the total number of items to process is greater than 10.

The fix

We enforce read size to be aligned with total capacity. In some cases it leads to smaller batches sent down to HTTP, but it should impact small runs only

Copy link

codecov bot commented Sep 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.46%. Comparing base (fbce8ec) to head (ef2ba94).
Report is 36 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1307      +/-   ##
==========================================
- Coverage   93.48%   93.46%   -0.02%     
==========================================
  Files         209      207       -2     
  Lines       33823    33518     -305     
==========================================
- Hits        31619    31328     -291     
+ Misses       2204     2190      -14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

// we must adjust read size. Adjusting total capacity is not possible due to
// possible deadlocks - it must be strictly aligned with active work.
// read size goes in `record_size` increments to keep it aligned.
// rem is aligned with both capacity and read_size, so subtracting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "rem"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be record_size

Comment on lines 283 to 300
// it will keep read_size and capacity aligned
// Here is an example how it may work:
// lets say the active work is set to 10, record size is 512 bytes
// and read size in gateway config is set to 2048 bytes (default value).
// the math above will compute total_capacity to 5120 bytes and
// proposed_read_size to 2048 because it is aligned with 512 record size.
// Now, if we don't adjust then we have an issue as 5120 % 2048 = 1024 != 0.
// Keeping read size like this will cause a deadlock, so we adjust it to
// 1024.
proposed_read_size - min_capacity % proposed_read_size
};

// total capacity must be a multiple of both read size and record size.
// Record size is easy to justify: misalignment here leads to either waste of memory
// or deadlock on the last write. Aligning read size and total capacity
// has the same reasoning behind it: reading less than total capacity
// can leave the last chunk half-written and backpressure from active work
// preventing the protocols to make further progress.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too complicated. I worry that this code will become unmaintainable as this is so impenetrable that people simply will not understand it. I certainly don't understand it and we discussed it in person today and I've read this twice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the explanation is overly complicated here. All we need to say here is that all 3 parameters need to be aligned. read_size is a multiple of record_size and total_capacity is a multiple of read_size

debug_assert_eq!(0, this.total_capacity.get() % this.record_size.get());
debug_assert!(this.total_capacity.get() >= this.read_size.get());
debug_assert!(this.total_capacity.get() >= this.record_size.get());
debug_assert!(this.read_size.get() >= this.record_size.get());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the comments above, I think you also want to check that:

debug_assert_eq!(0, this.read_size.get() % this.record_size.get());

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there is another condition that the comments seem to suggest we should check, which is:

debug_assert!(this.total_capacity.get() >= gateway_config.active.get() * record_size);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the addition of these two checks, I think this set of checks are equivalent to:

  1. read_size = a * record_size
  2. total_capacity = b * read_size
  3. a * b >= active_work

Where a and b are positive integers >= 1.

Can we just find a and b like this:

if total_records.is_indeterminate() || gateway_config.read_size.get() <= record_size {
    a = 1;
    b = active_work;
} else {
    a = gateway_config.read_size.get() / record_size;
    b = ceil(active_work / a);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record_size = 128
active_work = 33
a = 2048 / 128 = 16
b = ceil(33 / 16) = 3
read_size = a * record_size = 2048
total_capacity = b * read_size = 6144

The read size (2048) does not divide the active_work-determined capacity (4224), so this will hang.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, any prime number for active work makes this broken

total_capacity,
// closest multiple of record_size to read_size
// closest multiple of record_size to read_size
let proposed_read_size = min(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this isn't supposed to be max?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it probably shouldn't. If read_size goes above capacity, then we will never read anything from that buffer because even when it is 100% full, read_size is still larger, so HTTP will back off

Comment on lines +447 to +450
send_config::<StdArray<BA256, 16>, 10, 2048>(TotalRecords::specified(43).unwrap());

assert_eq!(0, config.total_capacity.get() % config.read_size.get());
assert_eq!(config.total_capacity.get(), 10 * config.record_size.get());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the number 10 on line 447 and line 450 are meant to be kept in sync, can you use a constant for them?

Comment on lines +472 to +483
// total capacity checks
assert!(config.total_capacity.get() > 0);
assert!(config.total_capacity.get() >= record_size);
assert!(config.total_capacity.get() <= record_size * active);
assert!(config.total_capacity.get() >= config.read_size.get());
assert_eq!(0, config.total_capacity.get() % config.record_size.get());

// read size checks
assert!(config.read_size.get() > 0);
assert!(config.read_size.get() >= config.record_size.get());
assert_eq!(0, config.total_capacity.get() % config.read_size.get());
assert_eq!(0, config.read_size.get() % config.record_size.get());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are duplicated between the test and the actual code. I don't think there is a need for this duplication. Just leaving them in the function seems sufficient.

@andyleiserson
Copy link
Collaborator

Why does read_size need to be a multiple of the record size? Is it possible that read_size dividing the active_work-determined capacity was always the more important property? It does seem like having read_size be a multiple of the record size is nice, because then we don't leave small pieces of data in the outbound buffer. But as long as we don't leave a small piece of data in the buffer adjacent to an active_work boundary, are the small pieces of data are a problem?

In any case, I'm not sure we can get what we need without factoring active_work, or doing something like restricting active_work to a power of two so we can readily pick factors.

// has the same reasoning behind it: reading less than total capacity
// can leave the last chunk half-written and backpressure from active work
// preventing the protocols to make further progress.
let total_capacity = min_capacity / read_size * read_size;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record_size = 128
active_work = 33
min_capacity = 4224
proposed_read_size = 2048
read_size = 2048 - 128 = 1920
total_capacity = 3840

@akoshelev
Copy link
Collaborator Author

akoshelev commented Sep 26, 2024

I am going to abandon this PR as the change we want to make is very different. @andyleiserson and I debated several options and here are our conclusions:

  • It is very hard to reason about tuning 3 different numbers (read size, active work and batch size) and keep them all aligned it not easy either. tests from Make active_work match records_per_batch #1316 showed that if at least one of them is not aligned with 2 others, we run into a deadlock
  • We prefer read size to be internal to network implementation. Protocol writers should not concern themselves with this number and irrespective of our choice for optimal buffer size for HTTP, it should just work.

Based on these, we think the optimal way forward would be to set active work to be strictly equal to ZKP batch size for malicious contexts. Also, we align read size with active work inside send buffer, allowing partial sends in rare cases. These are handled entirely at the network layer, so it shouldn't be a problem for protocol code.

This approach has a number of benefits. Firstly, it exposes only one knob to tune the active window for ZKP contexts, making it harder to run into a deadlock by writing wrong protocol code. Secondly, it keeps network stuff unexposed to the context layer, providing encapsulation and making it easy to tune/refactor later.

In order to simplify things even further, we force active work to be a power of two. This simplifies the alignment of read size

@akoshelev akoshelev closed this Oct 2, 2024
akoshelev added a commit to akoshelev/raw-ipa that referenced this pull request Oct 2, 2024
This is the second attempt to mitigate send buffer misalignment. Previous one (private-attribution#1307) didn't handle all the edge cases and was abandoned in favour of this PR.

What I believe makes this change work is the new requirement for active work to be a power of two. With this constraint, it is much easier to align the read size with it. Given that `total_capacity = active * record_size`, we can represent `read_size` as a multiple of `record_size` too:
`read_size = X * record_size`. If X is a power of two and active_work is a power of two, then they will always be aligned with each other.

For example, if active work is 16, read size is 10 bytes and record size is 3 bytes, then:

```
total_capacity = 16*3
read_size = X*3 (close to 10)
X = 2 (power of two that satisfies the requirement)
```

when picking up the read size, we are rounding down to avoid buffer overflows. In the example above, setting X=3 would make it closer to the desired read size, but it is greater than 10, so we pick 2 instead.
akoshelev added a commit to akoshelev/raw-ipa that referenced this pull request Oct 2, 2024
This is the second attempt to mitigate send buffer misalignment. Previous one (private-attribution#1307) didn't handle all the edge cases and was abandoned in favour of this PR.

What I believe makes this change work is the new requirement for active work to be a power of two. With this constraint, it is much easier to align the read size with it. Given that `total_capacity = active * record_size`, we can represent `read_size` as a multiple of `record_size` too:
`read_size = X * record_size`. If X is a power of two and active_work is a power of two, then they will always be aligned with each other.

For example, if active work is 16, read size is 10 bytes and record size is 3 bytes, then:

```
total_capacity = 16*3
read_size = X*3 (close to 10)
X = 2 (power of two that satisfies the requirement)
```

when picking up the read size, we are rounding down to avoid buffer overflows. In the example above, setting X=3 would make it closer to the desired read size, but it is greater than 10, so we pick 2 instead.
akoshelev added a commit to akoshelev/raw-ipa that referenced this pull request Oct 2, 2024
This is the second attempt to mitigate send buffer misalignment. Previous one (private-attribution#1307) didn't handle all the edge cases and was abandoned in favour of this PR.

What I believe makes this change work is the new requirement for active work to be a power of two. With this constraint, it is much easier to align the read size with it. Given that `total_capacity = active * record_size`, we can represent `read_size` as a multiple of `record_size` too:
`read_size = X * record_size`. If X is a power of two and active_work is a power of two, then they will always be aligned with each other.

For example, if active work is 16, read size is 10 bytes and record size is 3 bytes, then:

```
total_capacity = 16*3
read_size = X*3 (close to 10)
X = 2 (power of two that satisfies the requirement)
```

when picking up the read size, we are rounding down to avoid buffer overflows. In the example above, setting X=3 would make it closer to the desired read size, but it is greater than 10, so we pick 2 instead.
akoshelev added a commit to akoshelev/raw-ipa that referenced this pull request Oct 2, 2024
This is the second attempt to mitigate send buffer misalignment. Previous one (private-attribution#1307) didn't handle all the edge cases and was abandoned in favour of this PR.

What I believe makes this change work is the new requirement for active work to be a power of two. With this constraint, it is much easier to align the read size with it. Given that `total_capacity = active * record_size`, we can represent `read_size` as a multiple of `record_size` too:
`read_size = X * record_size`. If X is a power of two and active_work is a power of two, then they will always be aligned with each other.

For example, if active work is 16, read size is 10 bytes and record size is 3 bytes, then:

```
total_capacity = 16*3
read_size = X*3 (close to 10)
X = 2 (power of two that satisfies the requirement)
```

when picking up the read size, we are rounding down to avoid buffer overflows. In the example above, setting X=3 would make it closer to the desired read size, but it is greater than 10, so we pick 2 instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants