Authors:
- Jeffrey Yasskin (Google Chrome)
Participate:
Table of Contents
- Introduction
- Terminology
- Goals
- Non-goals
- Proposal: a
package:
scheme - Key scenarios
- Detailed design discussion
- Considered alternatives
- Stakeholder Feedback / Support / Opposition
- References & acknowledgements
A web bundle can be fetched from a URL, and it contains a set of HTTP responses identified by URL. This document explains how to name those component subresources directly from outside of the bundle and how those names interact with the other parts of the web platform API.
- A resource's Claimed URL: The URL that names the resource within a bundle.
- A resource's Bundle URL: The URL that can be fetched to retrieve the bundle that the resource appears inside of. As usual, the same sequence of bytes representing a bundle can be fetched from more than one URL.
- Distributor: the entity that serves a bundle. Identical to the owner of the Bundle URL's origin. This entity has the power to change the content of the bundle, including subresource named by URLs under other origins.
- Publisher: the owner of the Claimed URL's origin. There's no guarantee that the publisher had anything to do with the content in the bundle that's associated with their name.
When a bundle is used to serve a page's subresources, the simplest way to ensure the bundle is fetched before the subresource is to name the subresource relative to the bundle. That requires the subresources to have fully qualified names.
When the platform identifies a bundle subresource to some other entity, it must
give the other entity an accurate impression of who's responsible for the
subresource. This can show up in the URL bar, permission dialogs, the Referer
and Origin
headers, the postMessage()
targetOrigin
and event.origin
fields, etc.
When a bundle is used to serve a top-level page or an iframe, it often needs to store some data across reloads. This data needs to go in a storage shelf that's shared with the right set of other resources. It's easiest to see how storage needs to be partitioned with a series of examples.
Imagine that a bundle served at https://distributor.example/bundle.wbn
contains a subresource named https://foo.example/page.html
.
The bundle should use a different storage shelf from
https://distributor.example/page.html
.
This provides a way to create suborigins, which could be useful to sites like the Internet Archive. The Internet Archive currently serves copies of arbitrary websites from inside https://web.archive.org/, which means those websites can't be allowed to access Javascript storage lest they steal the main site's login cookies.
The subresource should use a different storage shelf from
https://foo.example/page.html
(the same subresource name) inside
https://distributor.example/otherbundle.wbn
.
- If an archive stores multiple versions of the same website in separate bundles, but those versions use storage differently, users couldn't easily try more than one version.
- If a user is relying on an application they have saved in one bundle, another website shouldn't be able to get access to that application's data just by creating a bundle that claims the same URLs.
The subresource should use a different storage shelf from
https://bar.example/page.html
(a different subresource origin) inside
https://distributor.example/bundle.wbn
(the same bundle).
This allows a single bundle, for example one created for El Paquete Semanal, to copy in many applications written by different authors that link to each other and that all use their own storage without conflicting with each other. This use case is also achievable by rewriting parts of the applications, so it should be the lowest priority in this section.
When a bundle is downloaded locally, the file path often contains the user's
name and may contain other sensitive information. This must not leak out onto
the network, either by being exposed inside the bundle which could make a
network request to send it, or by being included in
Origin
and
other request headers.
Even a network URL for a bundle has the chance of exposing private information in its path, which we need to be careful to avoid.
In order to eliminate cross-site tracking, the web is removing some cases that sent a path from one site to another. The URLs and origins of resources within packages shouldn't provide a new way to send paths between different sites.
It should be straightforward to save the resources in a site to a bundle and then run them from the bundle without changing the resources.
This document does not discuss how to establish that a response in a bundle is authoritative for its claimed URL. Other documents suggest using signatures, an online adoption protocol, being under the path of the bundle URL, or other techniques to establish authority, but the scheme here is agnostic to that choice.
We propose to define a new URL scheme, package:
, that encodes both the bundle
URL and the claimed URL into a single URL. See below for other reasonable
ways to do this encoding.
The package:
scheme avoids the //<authority>/
URL syntax to avoid expanding
the set of characters a URL parser needs to expect to find in an authority, but
the first component of the package:
path acts much like an authority in
picking out a storage shelf.
To put the right parts of the bundle and claimed URLs into that first path segment, the whole bundle URL and the part of the claimed URL before the path are percent-encoded.
For a bundle URL of https://distributor.example/package.wbn?q=query
and a claimed URL of https://claimed.example/path/page.html?q=query
, we get:
package:https%3a%2f%2fdistributor.example,package.wbn%3fq=query$https%3a%2f%2fpublisher.example/path/page.html?q=query
We define the origin for these URLs to consist of everything before the first /
:
package:https%3a%2f%2fdistributor.example,package.wbn%3fq=query$https%3a%2f%2fpublisher.example
These URLs and origins satisfy the goals of letting us directly address bundle subresouces and scoping storage correctly.
Operations within a single bundle should use the claimed URLs. This allows a bundle to act as an archive of a set of sites without breaking their internal assumptions.
This also helps prevent exposing private information: if a realm running inside a downloaded bundle could see the bundle URL and then make network requests, it could report the path to the bundle.
Inside the https://claimed.example/page.html
subresource of the
https://distributor.example/package.wbn
bundle:
expect(location.href).to.be("https://claimed.example/page.html");
expect(document.URL).to.be("https://claimed.example/page.html");
A navigation from package:bundle-url$claimed-url
to a target URL has two
referrer
policies:
the one delivered with the bundle and the one attached to the subresource inside
the bundle. The one delivered with the bundle needs to default to
strict-origin-when-cross-origin
(or stronger) as proposed in
whatwg/fetch#952, and then we apply
them both to the matching parts of the package:
URL.
We first apply the bundle's referrer policy to the source bundle URL, treating
it as same-origin with the target URL if the target URL is also a package:
URL
with a same-origin bundle URL.
If this drops the source URL's path, the referrer also doesn't include any part
of the claimed URL, yielding either no referrer
or package:bundle-origin
.
If applying the bundle's referrer policy includes the bundle URL's path, the
subresource's referrer policy is applied to the subresource's name. The target
is considered same-origin only if it's inside the same bundle and its claimed
URL is same-origin with the source's claimed URL.
If the subresource's computed referrer is no referrer
, but the bundle's
referrer is present, the Referer
string has the form
package:bundle-referrer
, with no $
marking a separation from the absent
subresource URL.
Generally a file's path shouldn't be included in the referrer given to a site
that's the target of a navigation from that
file. As files have opaque
origins,
they'll be cross-origin with any navigation target, and because of the referrer
policy of strict-origin-when-cross-origin
described above, they'll use the
serialization of the file's origin ("null") for navigations out of the same
bundle.
User agents strip some parts of the referrer on cross-site navigations in order to prevent tracking, and the above algorithm facilitates stripping the right parts of both the bundle URL and the claimed URL.
Specifically, if a site is trying to embed identifying information into a URL, it's equally easy to embed it in the path of the bundle URL or the origin (or path) of the claimed URLs. Since the above algorithm strips both of those with the same set of referrer policies, it achieves the goal of not creating cross-site messaging channels.
Within the same bundle, the bundle's URL isn't exposed, so
document.referrer
,
Request.referrer
,
and the Referer
header as exposed to a Service Worker's fetch
event have to show only the
filtered claimed URL when the referrer's bundle URL matches the realm's
bundle URL.
For example, if the true referrer is
package:https:,,distributor.example,package.wbn$https:,,claimed.example/page.html
, and the current document ispackage:https:,,distributor.example,package.wbn$https:,,claimed.example/page2.html
, thedocument.referrer
getter will notice the matching bundle URLs and return just the claimed URL out of the referrer. In this case,document.referrer
is:https://claimed.example/page.htmlOr if the current document is in a different bundle,
document.referrer
is:package:https:,,distributor.example,package.wbn$https:,,claimed.example/page.htmlIf the true referrer is
package:https:,,distributor.example,package.wbn
and the current document is in the same bundle,document.referrer
is the empty string.
To avoid adding exceptions to the above algorithm for
same-bundle navigations, there will be some cases where a same-bundle navigation
has no referrer but the same navigation between resources on the web would have
a referrer. Specifically, if the bundle is served with a referrer policy of
no-referrer
, origin
, or strict-origin
, the claimed URL is entirely
removed from the referrer, so regardless of the subresource's referrer policy,
document.referrer
will be the empty string.
When a resource inside a bundle makes a cross-origin network request, what
Origin
header should
be sent?
We propose the simple thing here: Compute the referrer that would be used for a navigation to the requested resource and send the origin of that referrer string, computed by dropping the path.
This omits Fetch's allowance for cors-mode and websocket requests to get the full origin even if the referrer policy would disallow it.
As with the referrer, when the Origin
header is observed within the same
bundle, most likely via service workers, the bundle part needs to be removed.
When sending a message, the sender
sets the
targetOrigin
field to determine which recipients can read it.
- When sending from outside the bundle to inside, or from inside one bundle to
inside a different bundle, the sender uses the full
package:bundle-url$claimed-origin
. - When sending from inside the bundle to inside, the sender uses just the claimed origin.
- When sending from inside the bundle to outside, the sender uses the same
target origin that a sender outside the bundle would use. This means that
package:a-bundle$https:,,source.example
uses the same target origin for sending to eitherhttps://target.example
orpackage:a-bundle$https:,,target.example
, which could let either receive messages meant for the other. This should be ok: things inside bundles are meant to be quotes of things outside, and if the entity who composed a bundle is dishonest about this, they can modify the sender as easily as its target.
When receiving a message, the recipient is told the sender's origin.
- When sending from inside the bundle to outside or from inside one bundle to
inside another bundle, the exposed value matches the
Origin
header we'd send for a request to the target resource. - When sending from outside the bundle to inside, the origin is the same as if the message were received outside a bundle.
- When sending from inside a bundle to inside the same bundle, the target is told the claimed origin of the source. This means there's no way to distinguish an intra-bundle message from a message that the browser can verify comes from its source, which should be ok. As for the target origin, if the bundle's composer wants to spoof a message, they could modify its recipient inside the bundle.
The current rendering advice in the URL
spec is not appropriate for the
default display of package:
URLs, as users won't understand the significance
of its "host" part,
https:,,distributor.example,otherpackage.wbn;q=query$https:,,publisher.example
.
We suggest that, in places the browser would render just a URL's host, it render
the host of the bundle URL, so just distributor.example
in the above
example. When the browser would render the full URL, it should show just the
bundle's URL with some indication that it's viewing just a piece of that bundle.
To edit the URL, the browser should allow the user to pick from the resources
contained inside the bundle instead of encouraging the user to edit the text of
the package:
URL.
How should a resource at a package:
URL be able to ask for permission to use
powerful web APIs? As with the URL bar, users should
generally see the bundle itself as requesting permission, since it's the
bundle's server that can control its content. However, a bundle server might
host user-provided content and have a better reputation than it wants to lend to
that content. As such, the bundle server needs to be able to choose whether
those permission requests can happen.
Permissions-Policy
may be a good way to express this. Specifically, we could treat bundled content
as cross-origin from its server, and only allow permission requests if the
server sends a Permissions-Policy: geolocation=(self "package:https://server.example")
header.
If the user downloads a bundle, by default it has a new bundle URL referring to its new location, and so it loses access to any storage it created while the user was using it online. This is undesirable, so user agents should take measures to avoid confusion and lost data. It's not clear yet what those measures should be:
- Store a mapping between the offline location and the online location, and
treat them as same-origin.
- This can't just be the offline bundle's Mark Of The Web, because if the user received the bundle on an SD card, that mark isn't trustworthy.
- Storing the offline location in a trusted place could still cause trouble if the user later mounts a less-trusted bundle there.
- The browser could safely store a hash of the offline bundle.
- Copy storage from the online bundle to the offline bundle when it's
downloaded.
- This could be confusing if the user continues updating the offline storage and then goes back to the online bundle.
The above design relies on the URL to both name a bundle
subresource and give its storage the right
scope. Instead, we could explicitly define that the
active storage shelf depends
on the bundle path to the resource (in addition to the frame path to the
window
).
Environment settings objects loaded from a bundle need to track the bundle regardless, so that fetches can look in the bundle before going to the network, so this isn't an extra piece of data to track.
Going this route makes same-bundle introspection easier, since the active URL of the document is the claimed URL. However, it makes it harder to ensure referrers and origins are correct since it avoids defining a syntax that can slot into the existing headers and fields. That could be positive—existing code won't be expecting a new scheme here, so one could cause compatibility issues—but also seems likely to cause confusion about which entity is acting.
There are several reasonable ways to encode the bundle URL and claimed URL into a single URL.
This is an open question, with only a small preference for the encoding described above.
It's easy to address a subresource of a bundle by putting the claimed URL into the fragment of the bundle URL. We can represent recursive fragments with percent-encoding or a key-value format for the fragments:
https://distributor.example/package.wbn?q=query#https://claimed.example/path/page.html?q=query%23fragment
https://distributor.example/package.wbn?q=query#url=https://claimed.example/path/page.html?q=query;fragment=fragment
(from the TAG's design)
This option doesn't immediately satisfy the goal to scope storage, but it could if we define a new scheme:
pkg+https://distributor.example/package.wbn?q=query#url=https://claimed.example/path/page.html?q=query;fragment=fragment
The downside here is that the origin computation has to take part of the fragment into account. This could work, with risks that:
- The fragment might sometimes get dropped by code that was written against the currently-correct assumption that the fragment doesn't affect the origin.
- Some code might only be looking at the host, which would fail to distinguish the bundle from its server.
The interpretation of a fragment depends on the MIME type of the resource it's
applied to. Because the bundle URL inside apkg+https
URL could refer to
several kinds of things with subresources, like ZIP, tar, 7z, etc. files, we
should find some way to provide a consistent fragment format for all of them.
The fact that the other formats only name their contents with paths and not full
URLs only slightly diminishes the
utility.
It would be slightly easier to read package:
URLs if we percent-encoded fewer
characters. Since /
and ?
are very likely to appear in the nested URLs, we
can replace them with ,
and ;
(for example) instead of percent-encoding
them. :
only really needs to be percent-encoded in ://
schemes where it
separates a port. We'd wind up with something like:
package:https:,,distributor.example,package.wbn;q=query$https:,,claimed.example/path/page.html?q=query
While none of the package:
URLs are readable by end-users (URLs in general
aren't readable by end-users), there's some benefit to making the URLs as
readable as possible for developers trying to figure out why their links are
broken. Minimizing the amount of percent-encoding helps with that.
Exactly how do we compose the package: URL?
The package:
URL contains two other URLs, which need to be encoded and
delimited so that they're clearly distinct. Because this involves characters
that would not appear in an https://
URL's authority, we choose to avoid the
generic URL's authority component, and instead use the package:<path>
form. To
clearly show the part of the URL that selects a storage shelf, we encode that
into the first path component. This leads to the following algorithm for
combining a bundle URL and a claimed URL into a package:
URL:
- Let the claimed URL prefix be the part of the claimed URL before its
path. For example, the prefix of
https://host/path
would behttps://host
, while the prefix ofurn:uuid:12345
would beurn:
. - Let the package percent encode set be the C0 control percent-encode
set and
,
,;
,$
, and%
. - Let the encoded bundle URL be the result of UTF-8-percent-encoding the bundle URL with the package percent encode set.
- Let the encoded claimed URL prefix be the result of UTF-8-percent-encoding the claimed URL prefix with the package percent encode set.
- In the encoded bundle URL and the encoded claimed URL prefix, replace
/
with,
and?
with;
. - Return the concatenation of:
package:
- The encoded bundle URL
$
- The encoded claimed URL prefix
- The path and query of the claimed URL.
To parse the package:
URL into a bundle URL and a claimed URL, we can use the following algorithm:
- Let package URL be the
parsed
package:
URL. - If package URL's path
doesn't have exactly 1 component, the URL is malformed. (The URL parser
doesn't parse out path segments for URLs without
://
.) - Split the only component of the package URL's path on the
$
into the encoded bundle URL and the encoded claimed URL. If it doesn't have a$
, it's malformed. - Split the encoded claimed URL into the encoded claimed URL prefix before
the first
/
if any, and the claimed URL path and query including the first/
and after. The claimed URL path and query is empty if there is no/
. - In both the encoded bundle URL and the encoded claimed URL prefix,
replace
,
with/
and;
with?
. - Let the bundle URL be the percent decoding of encoded bundle URL.
- Let the claimed URL be the concatenation of
- the percent decoding of the encoded claimed URL prefix with
- the claimed URL path and query.
- Return the bundle URL and the claimed URL.
We could reduce percent-encoding even more by only protecting the delimiter, yielding:
package:https://distributor.example/package.wbn?q=query$https://claimed.example/path/page.html?q=query
This is probably the second-most readable option behind using fragments. Like with the fragment, it makes it unclear which parts of the URL contribute to the storage key, but it doesn't incur the risk that fragments are dropped before they could contribute to the origin.
The primary use cases for package: URLs fetch the package with either an
https:
or file:
scheme. We could omit the package's scheme from the URL, and
infer it from whether the URL started with package://
or package:///
(whether the authority component is empty). So:
package://distributor.example/package.wbn?q=query$https://claimed.example/path/page.html?q=query
fetches its package from https://distributor.example/package.wbn?q=query
, and
package:///c:/Users/uname/Downloads/package.wbn$https://claimed.example/path/page.html?q=query
fetches its package from file:///c:/Users/uname/Downloads/package.wbn
.
draft-soilandreyes-arcp
and
draft-shur-pack-uri-scheme
propose schemes to address path-named components of other packaging formats.
pack:
proposes an encoding similar to the package:
scheme
here and could probably be extended to support
URL-named components.
When the source and target of a navigation aren't the same bundle, it would be
reasonable to omit all information about the bundle subresource from the
Referer
header, instead of only omitting that information when the referrer
policy omits the bundle's path. This would be simpler, but would strip analytics
information that sites are likely to find as useful as the path information we
allow in other cases.
For non-bundled resources, cors-mode and websocket requests send an Origin
header that's not modified by the referrer policy. It would be possible to send
something like package:bundle-origin
in those cases even if the bundle's
referrer policy is stricter.
In the other direction, we could cap the information in the Origin
header to
package:bundle-origin
, even if the referrer policy would allow more. We would
probably still have to send the whole origin for same-bundle requests.
- W3CTAG: No signals
- Browsers:
- Safari: No signals
- Firefox: Concerned about the impact on UI
- Samsung: No signals
- UC: No signals
- Opera: No signals
- Edge: No signals
- Web Developers : No signals
Many thanks for valuable feedback and advice from:
- Kinuko Yasuda
- Larry Masinter
- Mallory Knodel
- Martin Thomson
- Mike West
- Ryan Sleevi
- Ted Hardie
- The participants in the IETF WPACK Working Group
- Tsuyoshi Horo