Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data liberation] wp_rewrite_urls() #1893

Open
wants to merge 18 commits into
base: trunk
Choose a base branch
from

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Oct 14, 2024

Motivation for the change, related issues

A part of #1894.

Prototypes a wp_rewrite_urls() URL rewriter for block markup to migrate the content from, say, <a href="https://adamadam.blog"> to <a href="https://adamziel.com/blog">.

Status:

  • URL rewriting works to perhaps the greatest extent it ever did in WordPress migrations.
  • A few unit tests fail. Once we add 2000 tests, it is very likely that ~300 of them would fail.
  • The URL parser requires PHP 8.1. This is fine for some Playground applications, but we'll more compatibility to get any of this into WordPress core.
  • This PR features an outdated version of WP_HTML_Tag_Processor. Let's update it and find a way of not keeping a copy in this repo.

Details

This PR consists of a code ported from https://github.com/adamziel/site-transfer-protocol. It uses a cascade of parsers to pierce through the structured data in a WordPress post and replace the URLs matching the requested domain.

The data flow is as follows:

Parse HTML -> Parse block comments -> Parse attributes JSON -> Parse URLs

On a high level, this parsing cascade is handled by the WP_Block_Markup_Url_Processor class:

$p = new WP_Block_Markup_Url_Processor( $block_markup, $base_url );
while ( $p->next_url() ) {
	$parsed_matched_url = $p->get_parsed_url();
	// .. do processing
	$p->set_raw_url($new_raw_url);
}

Getting more into details, the WP_Block_Markup_Url_Processor extends the WP_HTML_Tag_Processor class and walks the block markup token by token. It then drills down into:

  • Text nodes – where matches URLs using regexps. This part can be improved to avoid regular expressions.
  • Block comments – where it parses the block attributes and iterates through them, looking for ones that contain valid URLs
  • HTML tag attributes – where it looks for ones that are reserved for URLs (such as <a href="">, looking for ones that contain valid URLs

The next_url() method moves through the stream of tokens, looking for the next match in one of the above contexts, and the set_raw_url() knows how to update each node type, e.g. block attributes updates are json_encode()-d.

Processing tricky inputs

When this code is fed into the migrator:

<!-- wp:paragraph -->
<p>
	<!-- Inline URLs are migrated -->
	🚀-science.com/science has the best scientific articles on the internet! We're also
	available via the punycode URL:
	
	<!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path -->
	&#104;ttps://xn---&#115;&#99;ience-7f85g.com/%73%63ience/.
	
	<!-- Correctly ignores similar–but–different URLs -->
	This isn't migrated: https://🚀-science.comcast/science <br>
	Or this: super-🚀-science.com/science
</p>
<!-- /wp:paragraph -->

<!-- Block attributes are migrated without any issue -->
<!-- wp:image {"src": "https:\/\/\ud83d\ude80-\u0073\u0063ience.com/%73%63ience/wp-content/image.png"} -->
<!-- As are URI HTML attributes -->
<img src="&#104;ttps://xn---&#115;&#99;ience-7f85g.com/science/wp-content/image.png">
<!-- /wp:image -->

<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>

This actual output is produced:

<!-- wp:paragraph -->
<p>
	<!-- Inline URLs are migrated -->
	science.wordpress.com has the best scientific articles on the internet! We're also
	available via the punycode URL:
	
	<!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path -->
	https://science.wordpress.com/.
	
	<!-- Correctly ignores similar–but–different URLs -->
	This isn't migrated: https://🚀-science.comcast/science <br>
	Or this: super-🚀-science.com/science
</p>
<!-- /wp:paragraph -->

<!-- Block attributes are migrated without any issue -->
<!-- wp:image {"src":"https:\/\/science.wordpress.com\/wp-content\/image.png"} -->
<!-- As are URI HTML attributes -->
<img src="https://science.wordpress.com/wp-content/image.png">
<!-- /wp:image -->

<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>

Remaining work

  • Add PHPCBF
  • Get to zero CBF errors
  • Get the unit tests to run in CI (e.g. run composer install)
  • Review the API shape
  • Add hundreds of unit tests

Follow-up work

Testing Instructions (or ideally a Blueprint)

CI runs the PHP unit tests. To run this on your local machine, do this:

cd packages/playground/data-liberation
composer install
cd ../../../
nx test:watch playground-data-liberation

@adamziel adamziel added the [Type] Enhancement New feature or request label Oct 14, 2024
@adamziel adamziel requested a review from a team as a code owner October 14, 2024 17:55
@adamziel adamziel changed the title [Data liberation] Prototype wp_rewrite_urls() [Data liberation] wp_rewrite_urls() Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Type] Enhancement New feature or request
Projects
Status: Needs review
Development

Successfully merging this pull request may close these issues.

1 participant