Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open questions about RWS #285

Closed
thezedwards opened this issue Feb 22, 2024 · 6 comments
Closed

Open questions about RWS #285

thezedwards opened this issue Feb 22, 2024 · 6 comments

Comments

@thezedwards
Copy link
Contributor

howdy after this was published (https://digiday.com/media/wtf-are-related-website-sets-in-googles-privacy-sandbox/) it raised questions for me about what is allowed under RWS, as I haven't seen specific clarifications in any documentation. If anyone were able to answer any of these or all that would be helpful:

  1. Are organizations allowed to submit multiple RWS submissions? The current 5-domain cap on "associated domains" makes it impossible for normal sized publisher groups to use this feature. Is G suggesting that organizations, with lets say 25 publisher domains in their network, submit 5 unique RWS submissions and group their sites based on something? Can you clarify grouping strategies or why this is still capped at 5 even though many organizations have tried to submit pull requests with over 5 associated domains?

2) If submitting multiple RWS submissions - 5 submissions w/ 5 sites each, total 25 sites - can you reuse the same "Service Domains" across each of these submissions? My understanding was that a service domain could NOT be in multiple submissions (as this creates obvious cross domain tracking possibilities), but if that's the case, how is a publisher group supposed to use the service domains concept? Are you suggesting that if a publisher group has a service domain currently in use (like for login syncs?) they need to clone that infrastructure across multiple domains? So this example publisher group with 25 domains and 5 RWS submissions would also need 5x of each "service Domain" being used? So 5x their SSO infrastructure? And just clone those services across each domain? That sounds outrageously costly to pivot into without a bunch of red shining lights pointing this out to the industry. There are some licensing products that are domain-based, so suddenly requiring 5x licenses just to maintain your own login SSO infrastructure seems untenable or at least a big cost increase.

  1. Are there any "owned & operated" restrictions for RWS submissions as seemingly alluded to @ https://github.com/WICG/first-party-sets?tab=readme-ov-file#use-cases but the opposite alluded to in the public content @ https://digiday.com/media/wtf-are-related-website-sets-in-googles-privacy-sandbox/? Will it be possible for an SSP investment group to drop their logo in the footer of websites they partially own / have invested in (with the primary purpose being to control their ads.txt/app-ads.txt and share revenue from it)(while not operating any of the stories / writing) - publisher investment schemes which are becoming increasingly popular - to list all the sites they have invested in, within 5 domain groupings in RWS? Or if the 5 domain cap is lifted, to submit all their domains they have invested in? Are there any explicit *owned & operated" frameworks to help folks navigate what is allowed now and what could get a domain removed in the future?

Thanks for any feedback on this,

@krgovind
Copy link
Collaborator

Hi @thezedwards, thanks for the questions.

  1. Are organizations allowed to submit multiple RWS submissions?

Yes, organizations are allowed to submit multiple sets, as long as the sets are disjoint from each other. It might make the most sense to group together sites that share a common user journey/functionality across them. The reason the limit is capped at 5 for associated sites is explained here.

  1. If submitting multiple RWS submissions - 5 submissions w/ 5 sites each, total 25 sites - can you reuse the same "Service Domains" across each of these submissions?

No, sets need to be disjoint; therefore it is not possible to reuse the same service sites across multiple sets.

Are you suggesting that if a publisher group has a service domain currently in use (like for login syncs?) they need to clone that infrastructure across multiple domains?

In this case, we would recommend using the FedCM API to facilitate logins instead of RWS. We've also published guidance for login-specific use-cases here.

Are there any "owned & operated" restrictions for RWS submissions as seemingly alluded to @ https://github.com/WICG/first-party-sets?tab=readme-ov-file#use-cases

This is the canonical resource for set formation requirements. We currently require common ownership for service sites, and ccTLD sites; but not for associated sites (which have the requirement that "affiliation with the set primary is clearly presented to users").

@thezedwards
Copy link
Contributor Author

Thanks for the response, @krgovind. A few follow-up questions:

“Yes, organizations are allowed to submit multiple sets, as long as the sets are disjoint from each other. It might make the most sense to group together sites that share a common user journey/functionality across them. The reason the limit is capped at 5 for associated sites is explained here.”

It appears there are 2 organizations who currently have more than 5 associatedSite domains listed within the live RWS list @ https://github.com/GoogleChrome/related-website-sets/blob/main/related_website_sets.JSON – the hearty[.]me records has 7 associatedSites + the yandex-team[.]ru has 9 associatedSites

Can you explain what happens to organizations who submit more than 5 domains even though this is “capped at 5 associated sites”? I was under the impression these would not get approved, but seeing several examples with more than 5 domains means I want to understand why an organization would purposefully do that and why it’s still allowed for an organization to submit more than 5.

Is there a specific pop-up or user facing dialog box that is shown after an end user visits their 6th associatedSite listed within the live RWS list? Does that publisher still then get access to the 3rd party cookies if the user allows some permission on that 6th visit?

If different users visit different 5-domain groupings within a 9-domain associatedSites grouping, would they never see a browser prompt unless they somehow visited a 6th site?

I’m trying to understand whether these publishers are purposefully doing this, or whether the lack of warning / error during submission has led them to believe their sets are appropriately sized.

////
When it comes to organizations like Yandex who submitted 9 associatedSites and had them approved, one of the associatedSites was this domain: turbopages[.]org/ which historically has had content on it’s homepage that maybe qualifies as a disclosure between the parent org (example @ https://web.archive.org/web/20240112201145/https://www.turbopages.org/) but there is nothing on the live homepage right now except an error message and it’s been that way since around the submission time @ https://www.turbopages.org/

What is the policy for organizations who drop the affiliation from these associateSites? How long can an organization remove this before they are dropped from RWS lists? Is there any automated process to check for changes like this?
////

“it is not possible to reuse the same service sites across multiple sets.”

Is there anything preventing a publisher from creating subdomains on their serviceSite, CNAME mapped to 3rd party domains, for specific service support?

Is there any intent in the future to uncloak CNAME data flows / break any type of DNS architecture within a serviceSite?

Does G expect to see huge numbers of 3rd party subdomains CNAME mapped to serviceSite subdomains? Is this allowed?

Is it accurate that even though a 3rd party service could get CNAME mapped into serviceSite, due to CHIPS/cookie partitioning, all of the RWS sites within one submission are technically “one partition”? So even if a service organization had 50 separate partners who mapped them to CNAME records on a serviceSite, they would have 50 separate cookie partitions based on the number of domains within each partner's RWS associatedSites list, and not one giant partition across unrelated clients?

//

Thank you for the clarification regarding login persistence via FedCM API – that doesn’t seem to be deployed yet in FireFox or Safari. But I suppose any RWS customizations would have been unique for Chrome anyways.

//

“We currently require common ownership for service sites, and ccTLD sites; but not for associated sites (which have the requirement that "affiliation with the set primary is clearly presented to users").”

How is “common ownership” determined for service sites? Is this merely through adding the appropriate JSON records within the /well-known/ directory on a primary domain referencing each serviceSite, and on the serviceSite referencing the primary domain?

There are examples live in the current RWS list like: "serviceSites": ["https://growthrx[.]in","https://clmbtech[.]com", “https://tvid[.]in"] which are all technically owned by the primary submitting entity based on my research, and accurately listed on the primary domain’s well-known RWS file @ https://timesinternet.in/.well-known/related-website-set[.]json but how was the shared-ownership of the serviceSites confirmed via the submission process?

What stops the primary entity from slipping in a 2nd domain they don’t own into the serviceSite list, which is technically owned by one of their vendors, and merely asking the vendor to add a corresponding record on their RWS json file? This seems to be the obvious place most ad tech vendors will want to be listed so they get mini cross-site cookie pools within a publisher group. Is that understood / expected / allowed?

thanks for any time on this and any responses you can share,

@thezedwards
Copy link
Contributor Author

Hi @krgovind - is it possible for you or someone else on the Google team to answer these questions next week? We're coming up on 2+ weeks here and I'm trying to get clarity about some things that I've spoken about publicly. Thanks

@krgovind
Copy link
Collaborator

Hi @thezedwards, I'm sorry. This got lost in my post-vacation backlog.

Can you explain what happens to organizations who submit more than 5 domains even though this is “capped at 5 associated sites”?

Indeed, these sets will be accepted via the GitHub process, but the browser (Chrome) will only apply our Storage Access API auto-granting rules to the first 5 domains; and ignores the remaining domains. Within Chrome, those additional domains will be treated as if they were never part of the set. The reason we chose to accept these sets is to allow for non-Chrome browsers/clients to choose different behaviors.

I’m trying to understand whether these publishers are purposefully doing this, or whether the lack of warning / error during submission has led them to believe their sets are appropriately sized.

This is good feedback. I think we could consider throwing a warning in this situation. Please consider filing this as a separate issue.

one of the associatedSites was this domain: turbopages[.]org/ which historically has had content on it’s homepage that maybe qualifies as a disclosure between the parent org (example @ https://web.archive.org/web/20240112201145/https://www.turbopages.org/)

Any chance you can post a screenshot from the archived page of the disclosure you are referring to? Both the archive and the actual site look identical to me based on what you've linked to.

What is the policy for organizations who drop the affiliation from these associateSites? How long can an organization remove this before they are dropped from RWS lists? Is there any automated process to check for changes like this?

I don't have an immediate answer on this for you; but I can look into this and get back when we have an answer. We currently do not have an automated process to check for these changes, and I suspect that it would be very difficult to automate.

However, note that this measure of subjectivity with respect to representation of affiliation for associated sites, and the difficulty of running technical checks for it is exactly why we have the 5 limit domain for associated sites. We believe this strikes the right balance between the privacy and user understanding considerations vs. scalability of the submission process. Affiliation representation on sites is also supplemented with browser UI that can help users discover RWS membership information.

How is “common ownership” determined for service sites? Is this merely through adding the appropriate JSON records within the /well-known/ directory on a primary domain referencing each serviceSite, and on the serviceSite referencing the primary domain?

We currently do not verify ownership via our technical checks, but we do verify common web administrative control via the .well-known file checks.

What stops the primary entity from slipping in a 2nd domain they don’t own into the serviceSite list, which is technically owned by one of their vendors, and merely asking the vendor to add a corresponding record on their RWS json file? This seems to be the obvious place most ad tech vendors will want to be listed so they get mini cross-site cookie pools within a publisher group. Is that understood / expected / allowed?

Our technical checks verify that sets are mutually exclusive from each other. So, even if an adtech vendor joined one publisher's set; they would be unable to join another publisher's set. I suspect this will make the solution unattractive to adtech vendors; unless you're suggesting that the vendor would create separate sites for each publisher group?

@thezedwards
Copy link
Contributor Author

howdy @krgovind - thanks much for the time.

I'm a bit surprised about the 5-domain cap + organizations being allowed to submit more than 5 domains without a warning. What other browsers have considered using RWS? And with more than a 5-domain cap? Is the Yandex inclusion with more than 5 domains some indication Yandex is doing this too in their browser? Or any other public details here?

There appears to be zero organizations who have submitted multiple 5-domain groupings. There's a huge gap between what G is claiming should be done and what the industry is actually doing. I'm shocked that G thinks it's appropriate to arbitrarily cap the size of a publisher group, without far more guidance on this process / error messages / industry publications etc. If ya'll want error messages when more than 5 domains are submitted, G will need to work on that, I don't have time to work on a new issue or pull request - sorry.

Why hasn't Google submitted RWS groups across their own sites like google.com / youtube.com / blogspot.com / etc? Is Google so afraid of the PR backlash of "keeping 3rd party cookies on their own sites" that they can't even dog food this? Ya'll have way more than 5 domains, so this seems perfect for grouping and explaining how to ideally group domains. Or pick some other big publisher as an example - but please - give the community some live examples and not just hypotheticals if ya'll are going to keep this 5 domain cap on publishers.

This entire 5-domain cap appears like it's going to make this not-usable for publisher groups, but very usable for ad tech intermediaries, which seems to be the opposite of the intent from this proposal.

I don't understand why the associatedSites don't have an owned & operated restriction and why the serviceSites don't have a restriction around 3rd party ad tech domains being integrated for ads use cases. Not having these policies, or even attempting to determine ownership beyond a well-known file, and not having clean ways to remove domains from these lists, basically ensures that RWS is going to get much worse over time.

Based on my understanding of where things are with RWS, it seems like at some point there will be access to unpartitioned cookies across a RWS set. So ad tech vendors will likely create unique domains, add them to RWS sets with publishers approvals as serviceSite, and then use current methods being done in cookieless environments where they try to use IP address as a join key to create partial grafts (described @ https://www.adweek.com/programmatic/some-cookieless-alternatives-still-use-cookies/), and use unrestricted unpartitioned 3rd party cookies to track the users on the sites where they have this access. And if those intermediaries get access to hashed emails on those domains, then they have a persistent cross-site hashed email join key AND an unpartitioned 3rd party cookie to use when the user isn't logged in on those RWS sites. This seems to be the obvious way ad tech intermediaries will want to use it -- the hashed emails only gets them halfway, and then the unpartitioned cookies get them access to track when the user isn't logged in on those RWS sites.

I've seen no details about if CNAME cloaking will be prevented if unpartitioned 3rd party cookies is green lit? If an ad tech intermediary gets one of their domains added to an RWS list, what stops them from creating hundreds of subdomains CNAME mapped to other 3rd party domains they control added in other RWS lists, and basically exchanges 3rd party cookies between their own RWS domains via querystring passing/other methods, and then basically associates these userID cookie arrays to users based on the IP addresses as the join key when they hit a new RWS site grouping? And then those ad tech intermediaries will clean up their graphs when they get hashed emails on any RWS publisher sites. This certainly won't be perfect, but tons of ad tech orgs over-stuff audience segments based on bad data, because it makes them money when they have userIDs filled with audience segments, so why wouldn't orgs continue to try this under RWS?

And to be clear, hashed emails are shared haphazardly currently, but if someone logs into site 1 in a RWS grouping, the ad tech intermediary can use the hashed email + 3rd party cookies to identify the user on site 2 in the RWS grouping without the person logging in. The site 2 visit will then have higher CPMs due to an identified user, and this feedback loop basically encourages ad tech intermediaries to get integrated into as many RWS groupings as possible, and hashed email login processes as possible, via unique domains they control.

Due to this current structure, RWS isn't going to get focus from publisher groups because it's close to worthless unless you absolutely need it, it's a ton of extra murky work to split your sites up into 5 domain chunks, and ad tech intermediaries are just going to eventually slide up to publishers with "audience graph solutions" merely requiring RWS submissions, just like how tons of intermediaries manage ads.txt/app-ads.txt for publishers and commit to massive mislabeling schemes and user data privacy violations through that access. If publishers can't use this in a normal enterprise way, but ad tech intermediaries can get legacy benefits from it, then ad tech intermediaries will eventually be the folks primarily using RWS.

As it stands, without any technical checks besides the well-known lookups, no policy requirements around owned & operated sites in a RWS associatedSite list or public disclaimers required on those sites, and no clear processes to review or remove sites that are out of compliance, this RWS solution is going to empower odd investment schemes that "check the boxes" to allow for a RWS submission, and it will generate significant amounts of increased data sharing by ad tech intermediaries / ad tech consortiums focused on IP address data and hashed emails, and it appears there will be unpartitioned 3rd party cookie access across RWS sites, which pours diesel on these ad tech intermediary opportunities with hashed emails.

I'll also note that since there is a 5 associatedSites cap, but no cap on serviceSites size - in theory this would support ad tech companies competing to get all their domains listed as serviceSites - but in practice, ad tech typically ends up doing stuff like the "Ad Tech Consortium" where specific domains are used, and then the data is shared with participants of the co-op. My understanding of RWS currently seems to indicate that serviceSite IP address user graphs will basically have different appending opportunities (hashed emails / RWS-list-unpartitioned-cookies) and this will basically empower sharing the IP / interest data with more and more vendors to try and improve their graphs.

Folks in ad tech know that IP address data isn't valuable for everyone, but if RWS lists also get someone access to avoid a gnatcatcher / IP address obfuscation in the future, as well as the unpartitioned cookies across a RWS list, then it seems like the lack of policies are going to sprawl this RWS feature into a core way that some types of audiences are tracked across websites by audience pools.

I strongly believe G needs to consider actual policies around RWS that requires owned & operated disclaimers on associatedSites, and explicit purposes that are allowed for serviceSites, and explicit behaviors that get sites removed from the RWS list. I understand ya'll can't respond to all of this, totally okay. Appreciate the time.

@krgovind
Copy link
Collaborator

Thanks for the feedback, @thezedwards.

I just opened #339 to capture your suggestion to throw a warning on sets with greater than 5 associated sites. We are currently not aware of any browser supporting RWS with a larger limit; but at the time that we designed the checks, we wanted to leave open the possibility. Your point about developer confusion is noted though, so we will look into adding this. We currently have two sets that exceed the limit; and for now we have manually informed both submitters (1, 2) of this.

In terms of a single company creating more than one set, I am aware that Axel Springer has created two sets (1, 2). Additionally, regarding interest from publishers, my understanding is that roughly a third of the sets submitted right now are from publishers.

Regarding why the associated subset doesn't have an owned & operated restriction, we explained our reasoning in a blog post here.

FYI, we are currently also experimenting with IP Address Protection for Chrome users; and expect to limit tracking of users via IP addresses as part of that intervention.

For third-party cookie deprecation; we currently do not plan to intervene on CNAME usage, because the cookie/storage functionality possible via use of CNAMEs is equivalent to that offered by partitioned cookies/storage. I'm not sure I see any incremental privacy risk of CNAME'ed service sites; considering that CNAMEs can be established to a subdomain on the primary site itself.

Nonetheless, we appreciate your feedback and will keep this mind as we continue to make incremental progress on our broader mission of improved privacy on the web.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants