-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Note about IPA wire format #65
base: main
Are you sure you want to change the base?
Conversation
# Interoperable Private Attribution wire format | ||
|
||
|
||
This documents provides clarification on the format IPA parties use to submit queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documents provides clarification on the format IPA parties use to submit queries. | |
This documents the format that report collectors use to submit IPA queries to helper party networks. |
|
||
The biggest savings from the custom format come from making each query to carry only one copy of unique site domain | ||
and match key provider origin strings. This proposal suggests building two lookup tables (one for each entity) on the | ||
caller site |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incomplete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that source queries have a single source site and trigger queries have a single trigger site, you could start by indicating the type of query (which can be implicit or part of the query creation step), then you can have two tables: one for the "same" side (source configurations for source queries, trigger configurations for trigger queries) and one for the "other" side (the converse). Then you could concatenate the two tables, index rows starting from zero and refer to the configurations.
Each row in the table would then effectively be a configuration that lists:
- Site: length prefixed ASCII; 1 byte length. Optimization hack: length = 0 copies the previous value.
- Epoch: 2 bytes.
- Key identifier: 1 byte.
There are three implied values that fill out the common stuff:
- (implied) Event type is inferred from the table type, so this is effectively run-length encoded.
- (implied) The match key provider should be the same for all events, so that can be part of query configuration.
- (implied) The helper party should know its own name, so that can be omitted completely.
Indexing into this table shouldn't take too many bytes. But I don't think that a 1 byte is going to work out in all cases. But the table size is known before you start processing individual items, so we can make the index size based on the table size (
The list of supported parameters include: | ||
|
||
| Header name | Type | Description | Accepted values | Default? | Mandatory? | | ||
|------------------|------------------------------|----------------------------------------|-----------------|----------|------------| | ||
| `x-ipa-field` | US-ASCII encoded string | Field type used to secret-share values | `fp32` | No | Yes | | ||
| `x-ipa-query` | US-ASCII encoded string | Desired query to run in MPC | `ipa` | `ipa` | No | | ||
| `x-ipa-version` | single byte unsigned integer | Version of the request | `1` | No | Yes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFC 6648.
We need a format for creating a query, which needs to include these values somehow. I would not use header fields for this, but instead define a payload format. This doesn't need to be tightly packed, so JSON is probably where I would go.
Also, some of this is information that could be part of the resource identity. That is, you would have one URI that does IPA and another that does something different. That means that you don't need to include explicit versioning.
Parameters are only necessary if you think that something needs tuning, or there are things that need to be known in order to accept the query. I think that we should directly signal the query size in this request as that has a direct bearing on what is being requested.
IPA already has a bunch of parameters that we have built into our implementation:
- The number of breakdown keys.
- The maximum value of individual trigger values.
- The per-user cap.
- The attribution window.
These are what I would expect to see in the request that creates a query.
by the implementations. | ||
|
||
The following simulation assumes each event to take **112 bytes** on the wire, including encryption overhead | ||
(see [encryption](.encryption.md)) and site origin to be a random 25-160 byte ASCII string. The overhead of sending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see [encryption](.encryption.md)) and site origin to be a random 25-160 byte ASCII string. The overhead of sending | |
(see [encryption](./encryption.md)) and site origin to be a random 25-160 byte ASCII string. The overhead of sending |
Note: The following estimations ignore the TCP/IP/Ethernet frame overhead as it remains the same regardless of the format chosen | ||
by the implementations. | ||
|
||
The following simulation assumes each event to take **112 bytes** on the wire, including encryption overhead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that you need to spell out assumptions here. Something like:
- enc = 32
- ciphertext = 2*40/8 = 10
- tag = 16
- site = 1 + 50 (say)
- key id = 1
- epoch = 2
- breakdown key = 1 (assuming XOR shares here and a small space, not sure about state of the art)
- trigger value = 4
- ts = 4 (not sure here again)
That's a little more than you have.
But with a table, and if we make breakdown key and trigger value mutually exclusive (and the same size), then we have 68 bytes, plus the table size, which is trivial for a large data set.
The other thing with tables is that you can reuse them...
|
||
## Assumptions | ||
|
||
* Report collector use HTTP over TLS to send queries to helper party networks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we choosing http instead of tls here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"over" here means that the one protocol layer (HTTP) runs on top of the other (TLS). It doesn't mean "instead of", even though that is another meaning that "over" can take, it isn't the usual assumption in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh ok.. maybe we should reword it to say "on top of" TLS to avoid confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh, I never read it that way - too used to this expression. I agree that it does sound confusing, so I'll just use HTTPS instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"HTTP over TLS" is the name of the protocol. Or "HTTPS". To call it something else would be far worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didnt know about that. How about we add a link to RFC for it : https://www.rfc-editor.org/rfc/rfc2818
We're getting close to run an in-market test and it is worth debating how the input format for IPA would look like. Given that there is a lot of information included into AAD for each event, it may be possible to get significant savings on the wire (10-30%), given some assumptions about the input ($N_{sites} \ll N_{events}$ )