[Data liberation] wp_rewrite_urls() #1893

adamziel · 2024-10-14T17:55:23Z

Motivation for the change, related issues

A part of #1894.

Prototypes a wp_rewrite_urls() URL rewriter for block markup to migrate the content from, say, <a href="https://adamadam.blog"> to <a href="https://adamziel.com/blog">.

Status:

URL rewriting works to perhaps the greatest extent it ever did in WordPress migrations.
A few unit tests fail. Once we add 2000 tests, it is very likely that ~300 of them would fail.
The URL parser requires PHP 8.1. This is fine for some Playground applications, but we'll more compatibility to get any of this into WordPress core.
This PR features an outdated version of WP_HTML_Tag_Processor. Let's update it and find a way of not keeping a copy in this repo.

Details

This PR consists of a code ported from https://github.com/adamziel/site-transfer-protocol. It uses a cascade of parsers to pierce through the structured data in a WordPress post and replace the URLs matching the requested domain.

The data flow is as follows:

Parse HTML -> Parse block comments -> Parse attributes JSON -> Parse URLs

On a high level, this parsing cascade is handled by the WP_Block_Markup_Url_Processor class:

$p = new WP_Block_Markup_Url_Processor( $block_markup, $base_url );
while ( $p->next_url() ) {
	$parsed_matched_url = $p->get_parsed_url();
	// .. do processing
	$p->set_raw_url($new_raw_url);
}

Getting more into details, the WP_Block_Markup_Url_Processor extends the WP_HTML_Tag_Processor class and walks the block markup token by token. It then drills down into:

Text nodes – where matches URLs using regexps. This part can be improved to avoid regular expressions.
Block comments – where it parses the block attributes and iterates through them, looking for ones that contain valid URLs
HTML tag attributes – where it looks for ones that are reserved for URLs (such as <a href="">, looking for ones that contain valid URLs

The next_url() method moves through the stream of tokens, looking for the next match in one of the above contexts, and the set_raw_url() knows how to update each node type, e.g. block attributes updates are json_encode()-d.

Processing tricky inputs

When this code is fed into the migrator:

<!-- wp:paragraph -->
<p>
	<!-- Inline URLs are migrated -->
	🚀-science.com/science has the best scientific articles on the internet! We're also
	available via the punycode URL:
	
	<!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path -->
	&#104;ttps://xn---&#115;&#99;ience-7f85g.com/%73%63ience/.
	
	<!-- Correctly ignores similar–but–different URLs -->
	This isn't migrated: https://🚀-science.comcast/science <br>
	Or this: super-🚀-science.com/science
</p>
<!-- /wp:paragraph -->

<!-- Block attributes are migrated without any issue -->
<!-- wp:image {"src": "https:\/\/\ud83d\ude80-\u0073\u0063ience.com/%73%63ience/wp-content/image.png"} -->
<!-- As are URI HTML attributes -->
<img src="&#104;ttps://xn---&#115;&#99;ience-7f85g.com/science/wp-content/image.png">
<!-- /wp:image -->

<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>

This actual output is produced:

<!-- wp:paragraph -->
<p>
	<!-- Inline URLs are migrated -->
	science.wordpress.com has the best scientific articles on the internet! We're also
	available via the punycode URL:
	
	<!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path -->
	https://science.wordpress.com/.
	
	<!-- Correctly ignores similar–but–different URLs -->
	This isn't migrated: https://🚀-science.comcast/science <br>
	Or this: super-🚀-science.com/science
</p>
<!-- /wp:paragraph -->

<!-- Block attributes are migrated without any issue -->
<!-- wp:image {"src":"https:\/\/science.wordpress.com\/wp-content\/image.png"} -->
<!-- As are URI HTML attributes -->
<img src="https://science.wordpress.com/wp-content/image.png">
<!-- /wp:image -->

<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>

Remaining work

Follow-up work

Patch WP_HTML_Tag_Processor in WordPress core, see HTML API: Add set_modifiable_text() for replacing text nodes. wordpress-develop#7007 (comment)
Package our copy of WP_HTML_Tag_Processor as a "WordPress polyfill" for standalone usage.
Make it compatible with PHP 7.2+

Testing Instructions (or ideally a Blueprint)

CI runs the PHP unit tests. To run this on your local machine, do this:

cd packages/playground/data-liberation
composer install
cd ../../../
nx test:watch playground-data-liberation

adamziel added 5 commits October 14, 2024 18:43

Data liberation: Kickoff the project

1ef710f

Port the URL rewriters from adamziel/site-transfer-protocol

234a8bf

Port WP_HTML_Processor et al. from WordPress

819febd

Move WordPress core files

0a6167b

Outline the next steps

826fe75

adamziel added the [Type] Enhancement New feature or request label Oct 14, 2024

adamziel requested a review from a team as a code owner October 14, 2024 17:55

adamziel added this to the Data Liberation: URL Rewriting milestone Oct 14, 2024

adamziel changed the title ~~[Data liberation] Prototype wp_rewrite_urls()~~ [Data liberation] wp_rewrite_urls() Oct 14, 2024

adamziel mentioned this pull request Oct 14, 2024

[Data Liberation] Tracking issue #1894

Open

adamziel added 13 commits October 14, 2024 23:08

Add PHPCS and CBF

0633e6f

Update HTML API, fix unit tests

4406fcf

Merge branch 'trunk' into data-liberation-bring-in-php-parsers

0cfd334

Bump CI PHP version to 8.1

b90a9d6

Adjust the CI setup for PHP

081535b

Run npm instlal insteaf of installing just nx

aca88fe

Use the correct nx project name

897af50

Remove the network functions and only lint the src directory

f7679b0

Remove special casing for direct matching pathname prefixes

5b9ec7d

Fix linting errors

97fed71

Move the additional functions to pbpcbf.php

96c1ce4

Replace iterate_urls with url_matches

e15408a

Lint PHP

b788eea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data liberation] wp_rewrite_urls() #1893

[Data liberation] wp_rewrite_urls() #1893

adamziel commented Oct 14, 2024 •

edited

Loading

[Data liberation] wp_rewrite_urls() #1893

Are you sure you want to change the base?

[Data liberation] wp_rewrite_urls() #1893

Conversation

adamziel commented Oct 14, 2024 • edited Loading

Motivation for the change, related issues

Details

Processing tricky inputs

Remaining work

Follow-up work

Testing Instructions (or ideally a Blueprint)

adamziel commented Oct 14, 2024 •

edited

Loading