-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Note about IPA wire format #65
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,172 @@ | ||||||
# Interoperable Private Attribution wire format | ||||||
|
||||||
|
||||||
This documents provides clarification on the format IPA parties use to submit queries. | ||||||
|
||||||
## Query | ||||||
|
||||||
IPA query consists of a mix of source and trigger events obtained from one or many source and trigger websites. | ||||||
**Query size** is determined by the number of events included within a single query request. | ||||||
|
||||||
It is desirable for report collectors to submit large queries as it brings more utility and saves cost, | ||||||
therefore it makes sense to optimize the query format on the wire. | ||||||
|
||||||
The following sections propose the format that is space-optimized at the expense of being more complicated to assemble. | ||||||
|
||||||
## Assumptions | ||||||
|
||||||
* Report collector use HTTP over TLS to send queries to helper party networks. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why are we choosing http instead of tls here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "over" here means that the one protocol layer (HTTP) runs on top of the other (TLS). It doesn't mean "instead of", even though that is another meaning that "over" can take, it isn't the usual assumption in this context. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh ok.. maybe we should reword it to say "on top of" TLS to avoid confusion? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. heh, I never read it that way - too used to this expression. I agree that it does sound confusing, so I'll just use HTTPS instead. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "HTTP over TLS" is the name of the protocol. Or "HTTPS". To call it something else would be far worse. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Didnt know about that. How about we add a link to RFC for it : https://www.rfc-editor.org/rfc/rfc2818 |
||||||
* Number of events within a single query is between $10^6$ and $10^9$. | ||||||
* Number of unique source and trigger websites is significantly lower than total number of events in the input set. | ||||||
|
||||||
|
||||||
## Format considerations | ||||||
|
||||||
It is worth to look at a single event first. To be concrete, this assumes match key to be a 40-bit length byte string, | ||||||
but this does not change the fundamentals. It is described in more details | ||||||
[here](https://github.com/patcg-individual-drafts/ipa/blob/main/IPA-End-to-End.md#generating-source-and-trigger-reports-by-the-report-collector). | ||||||
|
||||||
A single event consists of encrypted replicated shares of match key share, replicated shares of event data and additional | ||||||
authenticated data. Event data varies depending on whether it is a source or a trigger event. Authenticated data is used | ||||||
to decrypt the shares of match key. | ||||||
|
||||||
Two things that consume the most space on the wire are **site registrable domain** and **match key provider origin**. | ||||||
Both are ASCII strings, potentially large, that each event must refer to in order for helper parties to correctly obtain | ||||||
the plain text shares. | ||||||
|
||||||
The biggest savings from the custom format come from making each query to carry only one copy of unique site domain | ||||||
and match key provider origin strings. This proposal suggests building two lookup tables (one for each entity) on the | ||||||
caller site | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. incomplete There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that source queries have a single source site and trigger queries have a single trigger site, you could start by indicating the type of query (which can be implicit or part of the query creation step), then you can have two tables: one for the "same" side (source configurations for source queries, trigger configurations for trigger queries) and one for the "other" side (the converse). Then you could concatenate the two tables, index rows starting from zero and refer to the configurations. Each row in the table would then effectively be a configuration that lists:
There are three implied values that fill out the common stuff:
Indexing into this table shouldn't take too many bytes. But I don't think that a 1 byte is going to work out in all cases. But the table size is known before you start processing individual items, so we can make the index size based on the table size ( |
||||||
|
||||||
|
||||||
## Proposed format | ||||||
|
||||||
Each query request must carry the following information: | ||||||
```text | ||||||
[match key][event data][authenticated data][match key][event data][authenticated data]... | ||||||
``` | ||||||
|
||||||
This proposal suggests splitting query requests into two sections: lookup tables section for match key providers and | ||||||
site domains and payload section with encrypted replicated shares of match key and event data. Each encryption is annotated | ||||||
with a unique id pair that points to site and match key provider strings that are used to authenticate the match key | ||||||
encryption. | ||||||
|
||||||
```text | ||||||
.lookups | ||||||
[site origin 1][site origin 2]..[site origin N] | ||||||
[match key provider 1][match key provider 2]..[match key provider M] | ||||||
.payload | ||||||
[site origin id][match key provider id][authenticated data][match key][event data] | ||||||
[site origin id][match key provider id][authenticated data][match key][event data] | ||||||
... | ||||||
``` | ||||||
|
||||||
where `id` is the index of site origin or match key provider inside the lookup table. | ||||||
|
||||||
It is natural to assume $M \ll N$, so fewer bits is required to encode match key provider index. | ||||||
|
||||||
The total number of site domain entries inside the lookup table must be less than $2^{32}$. | ||||||
The total number of match key provider origin entries inside the lookup table must be less than $2^8$. | ||||||
|
||||||
|
||||||
### Metadata | ||||||
|
||||||
Every query request must specify several parameters to the helper parties that impact the size of the payload. | ||||||
These parameters are sent in the [header](https://www.rfc-editor.org/rfc/rfc9110.html#name-header-fields) of HTTP request. | ||||||
|
||||||
The list of supported parameters include: | ||||||
|
||||||
| Header name | Type | Description | Accepted values | Default? | Mandatory? | | ||||||
|------------------|------------------------------|----------------------------------------|-----------------|----------|------------| | ||||||
| `x-ipa-field` | US-ASCII encoded string | Field type used to secret-share values | `fp32` | No | Yes | | ||||||
| `x-ipa-query` | US-ASCII encoded string | Desired query to run in MPC | `ipa` | `ipa` | No | | ||||||
| `x-ipa-version` | single byte unsigned integer | Version of the request | `1` | No | Yes | | ||||||
Comment on lines
+77
to
+83
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. RFC 6648. We need a format for creating a query, which needs to include these values somehow. I would not use header fields for this, but instead define a payload format. This doesn't need to be tightly packed, so JSON is probably where I would go. Also, some of this is information that could be part of the resource identity. That is, you would have one URI that does IPA and another that does something different. That means that you don't need to include explicit versioning. Parameters are only necessary if you think that something needs tuning, or there are things that need to be known in order to accept the query. I think that we should directly signal the query size in this request as that has a direct bearing on what is being requested. IPA already has a bunch of parameters that we have built into our implementation:
These are what I would expect to see in the request that creates a query. |
||||||
|
||||||
### Lookup table | ||||||
|
||||||
The site origin lookup table consists of RLE unique site domain values and unique match key provider origins, encoded as | ||||||
ASCII strings. Each section is terminated with a single zero-valued byte. | ||||||
|
||||||
For example, a query that has two unique site origins and one match key provider will have the lookup table encoded as | ||||||
follows: | ||||||
|
||||||
```text | ||||||
15www.example.com7docs.rs\016matchkeyprovider | ||||||
``` | ||||||
|
||||||
All entries are implicitly zero-indexed and the unique index of each entry is used inside the payload to indicate the | ||||||
site origin all events within that group are associated with. | ||||||
|
||||||
In the example above, `www.example.com` would have index 0 and `docs.rs` would be associated with the index 1. | ||||||
|
||||||
### Payload | ||||||
|
||||||
Payload section carries the match key encryption and event data along with additional authenticated data not included | ||||||
in the lookup section. It includes one or more event encoded as follows: | ||||||
|
||||||
1) The unique index of the site origin for this event encoded as four-byte integer in big-endian byte order. | ||||||
This index must be unique inside the payload group and be a valid index from the lookup table. | ||||||
2) The unique index of the match key provider origin for this event encoded as single-byte integer. | ||||||
3) The single-byte key identifier from the key configuration for the helper party. | ||||||
4) The current epoch, encoded as a two-byte integer in big-endian byte order. | ||||||
This index must be unique inside the payload group and be a valid index from the lookup table. | ||||||
5) The [HPKE](https://datatracker.ietf.org/doc/html/rfc9180) of replicated match key shares: | ||||||
1) The 32 byte [encapsulated key](https://datatracker.ietf.org/doc/html/rfc9180#section-4) encoded in big endian byte order. | ||||||
2) The PKE of match key shares encoded in big endian order. Using 40 bit match keys will result in 80 bit encryption, etc. | ||||||
3) The 16 byte authentication tag encoded in big endian byte order. | ||||||
6) The timestamp of the event encoded as three-byte integer in big-endian byte order. Timestamp of the event is represented as | ||||||
number of seconds since the beginning of epoch. | ||||||
7) The secret-shared value of the trigger bit, encoded as two [field](#metadata) values in big-endian order. | ||||||
8) The secret-shared value of the trigger value, encoded as two [field](#metadata) values in big-endian order. | ||||||
9) The secret-shared value of the breakdown key, encoded as two [field](#metadata) values in big-endian order. | ||||||
|
||||||
## Simulation | ||||||
|
||||||
It's worth simulating various distributions of events between unique site origins to estimate potential savings on the wire. | ||||||
Intuitively, the biggest gains can be achieved when relatively few site origin have large number of events associated with them. | ||||||
|
||||||
Note: The following estimations ignore the TCP/IP/Ethernet frame overhead as it remains the same regardless of the format chosen | ||||||
by the implementations. | ||||||
|
||||||
The following simulation assumes each event to take **112 bytes** on the wire, including encryption overhead | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that you need to spell out assumptions here. Something like:
That's a little more than you have. But with a table, and if we make breakdown key and trigger value mutually exclusive (and the same size), then we have 68 bytes, plus the table size, which is trivial for a large data set. The other thing with tables is that you can reuse them... |
||||||
(see [encryption](.encryption.md)) and site origin to be a random 25-160 byte ASCII string. The overhead of sending | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
additional authenticated data is ignored except for site domain. The assumption is match key provider set per query | ||||||
is small and while an additional lookup table is warranted, the relative overhead won't be visible in the simulations. | ||||||
|
||||||
### 1M input | ||||||
|
||||||
When input size is 1M events, total size without any optimizations is **194 MiB**. | ||||||
|
||||||
| Unique site origins | Optimized size | | ||||||
| --- | --- | | ||||||
| 1 M | 194.5 MiB | | ||||||
| 500 k | 150.7 MiB | | ||||||
| 250 k | 128.7 MiB | | ||||||
| 100 k | 115.6 MiB | | ||||||
| 50 k | 111.2 MiB | | ||||||
| 20 k | 108.6 MiB | | ||||||
| 10 k | 107.7 MiB | | ||||||
|
||||||
### 1B input | ||||||
|
||||||
With 1 billion events, | ||||||
savings between 10K and 1M unique site origins become marginal. | ||||||
Without any optimization, 1B events will take **190 GiB**. | ||||||
|
||||||
| Unique site origins | Optimized size | | ||||||
| --- | --- | | ||||||
| 500 M | 147.2 GiB | | ||||||
| 100 M | 112.9 GiB | | ||||||
| 10 M | 105.2 GiB | | ||||||
| 1 M | 104.4 GiB | | ||||||
| 500 k | 104.4 GiB | | ||||||
| 250 k | 104.3 GiB | | ||||||
| 100 k | 104.3 GiB | | ||||||
| 50 k | 104.3 GiB | | ||||||
| 20 k | 104.3 GiB | | ||||||
| 10 k | 104.3 GiB | | ||||||
|
||||||
|
||||||
Space gains vary from 30% to 50% assuming the number of unique websites lies between 10% to 30%. | ||||||
|
||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.